PWL Seattle #23 - A Few Useful Things to Know About Machine Learning

AFewUsefulThingstoKnowaboutMachineLearning

PedroDomingos

PreparedbyTristanPenman

PapersWeLove@Sea8le,September2016

AboutThePaper

A Few Useful Things to Know about Machine Learning

Pedro DomingosDepartment of Computer Science and Engineering

University of WashingtonSeattle, WA 98195-2350, [email protected]

ABSTRACTMachine learning algorithms can figure out how to performimportant tasks by generalizing from examples. This is of-ten feasible and cost-effective where manual programmingis not. As more data becomes available, more ambitiousproblems can be tackled. As a result, machine learning iswidely used in computer science and other fields. However,developing successful machine learning applications requiresa substantial amount of “black art” that is hard to find intextbooks. This article summarizes twelve key lessons thatmachine learning researchers and practitioners have learned.These include pitfalls to avoid, important issues to focus on,and answers to common questions.

1. INTRODUCTIONMachine learning systems automatically learn programs fromdata. This is often a very attractive alternative to manuallyconstructing them, and in the last decade the use of machinelearning has spread rapidly throughout computer scienceand beyond. Machine learning is used in Web search, spamfilters, recommender systems, ad placement, credit scoring,fraud detection, stock trading, drug design, and many otherapplications. A recent report from the McKinsey Global In-stitute asserts that machine learning (a.k.a. data mining orpredictive analytics) will be the driver of the next big wave ofinnovation [16]. Several fine textbooks are available to inter-ested practitioners and researchers (e.g, [17, 25]). However,much of the “folk knowledge” that is needed to successfullydevelop machine learning applications is not readily avail-able in them. As a result, many machine learning projectstake much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairlyeasy to communicate. This is the purpose of this article.

Many different types of machine learning exist, but for il-lustration purposes I will focus on the most mature andwidely used one: classification. Nevertheless, the issues Iwill discuss apply across all of machine learning. A classi-fier is a system that inputs (typically) a vector of discreteand/or continuous feature values and outputs a single dis-crete value, the class. For example, a spam filter classifiesemail messages into“spam”or“not spam,”and its input maybe a Boolean vector x = (x1, . . . , xj , . . . , xd), where xj = 1if the jth word in the dictionary appears in the email andxj = 0 otherwise. A learner inputs a training set of exam-ples (xi, yi), where xi = (xi,1, . . . , xi,d) is an observed inputand yi is the corresponding output, and outputs a classifier.The test of the learner is whether this classifier produces the

correct output yt for future examples xt (e.g., whether thespam filter correctly classifies previously unseen emails asspam or not spam).

2. LEARNING = REPRESENTATION +

EVALUATION + OPTIMIZATIONSuppose you have an application that you think machinelearning might be good for. The first problem facing youis the bewildering variety of learning algorithms available.Which one to use? There are literally thousands available,and hundreds more are published each year. The key to notgetting lost in this huge space is to realize that it consistsof combinations of just three components. The componentsare:

Representation. A classifier must be represented in someformal language that the computer can handle. Con-versely, choosing a representation for a learner is tan-tamount to choosing the set of classifiers that it canpossibly learn. This set is called the hypothesis spaceof the learner. If a classifier is not in the hypothesisspace, it cannot be learned. A related question, whichwe will address in a later section, is how to representthe input, i.e., what features to use.

Evaluation. An evaluation function (also called objectivefunction or scoring function) is needed to distinguishgood classifiers from bad ones. The evaluation functionused internally by the algorithm may differ from theexternal one that we want the classifier to optimize, forease of optimization (see below) and due to the issuesdiscussed in the next section.

Optimization. Finally, we need a method to search amongthe classifiers in the language for the highest-scoringone. The choice of optimization technique is key to theefficiency of the learner, and also helps determine theclassifier produced if the evaluation function has morethan one optimum. It is common for new learners tostart out using off-the-shelf optimizers, which are laterreplaced by custom-designed ones.

Table 1 shows common examples of each of these three com-ponents. For example, k-nearest neighbor classifies a testexample by finding the k most similar training examplesand predicting the majority class among them. Hyperplane-based methods form a linear combination of the features perclass and predict the class with the highest-valued combina-tion. Decision trees test one feature at each internal node,

•  Firstpublishedin2012inCACM

•  RecognizesthatsuccessMachineLearningissomethingofablackart,communicatedasfolkknowledge

•  SummarizestwelvekeylessonsaboutMachineLearningthatshouldbeinternalizedbystudentsandpracKKonersalike

•  Doesnotcontainalltheanswers,butpointsyouintherightdirecKonandchallengesassumpKons

AboutTheAuthor

•  PedroDomingos–  ProfessoratTheUniversityofWashington–  ProlificauthorandresearcherinthefieldsofMachineLearningandDataMining

–  AuthorofTheMasterAlgorithm

•  HiswebsitecontainsaplethoraofinteresKngmaterial:h0p://homes.cs.washington.edu/~pedrod/

•  Thepaperdiscussedinthistalkcanfoundat:h0p://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

•  Exploreasampleoffivelessonsincludedinthepaper–  MoKvatestudentstoreadthepaperandtogainanappreciaKonfor

pracAcalMachineLearning–  MoKvatepracKKonerstoreadthepaper,andplantore-readit

mulApleAmesthroughouttheircareer

•  Interspersewithsomeusefulresources–  Bookscoveringimportantbackgroundmaterial–  Someonlineresources

•  AndtoreinforceamindsetrequiredforthesuccessfulapplicaKonofMachineLearning

TalkOverview

•  Learning=RepresentaAon+EvaluaAon+OpAmizaAon•  It’sGeneralizaAonThatCounts•  DataAloneisNotEnough•  OverfiOngHasManyFaces•  IntuiAonFailsInHighDimensions•  TheoreAcalGuaranteesAreNotWhatTheySeem•  FeatureEngineeringistheKey•  MoreDataBeatsaClevererAlgorithm•  LearnManyModels,NotJustOne•  SimplicityDoesNotImplyAccuracy•  RepresentableDoesNotImplyLearnable•  CorrelaAonDoesnotImplyCausaAon

KeyLessons

•  Learning=RepresentaAon+EvaluaAon+OpAmizaAon•  It’sGeneralizaAonThatCounts•  DataAloneisNotEnough•  OverfiOngHasManyFaces•  IntuiAonFailsInHighDimensions•  TheoreAcalGuaranteesAreNotWhatTheySeem•  FeatureEngineeringistheKey•  MoreDataBeatsaClevererAlgorithm•  LearnManyModels,NotJustOne•  SimplicityDoesNotImplyAccuracy•  RepresentableDoesNotImplyLearnable•  CorrelaAonDoesnotImplyCausaAon

KeyLessons

Learning=RepresentaAon+EvaluaAon+OpAmizaAon

–  Representa0on:“Aclassifiermustberepresentedinsomeformallanguagethatthecomputercanhandle”

–  Evalua0on:“AnevaluaAonfuncAonisneededtodisAnguishgoodclassifiersfrombadones.”

– Op0miza0on:“weneedamethodtosearchamongtheclassifiersinthelanguageforthehighest-scoringone”

Learning=RepresentaAon+EvaluaAon+OpAmizaAon

OverfiOngHasManyFaces•  “Whatiftheknowledgeanddatawehavearenotsufficientto

completelydeterminethecorrectclassifier?”

•  “ThenweruntheriskofjusthallucinaAngaclassifier…thatisnotgroundedinreality”

•  Bediligentaboutmaintainingseparatetrainingandtestdata

•  Studyandunderstandyourtools:–  Cross-validaKon–  Bootstrapping–  RegularizaKon

LatestageDeepDreamprocessedphotographofthreemeninapool,showingthattherearesomeusefulanalogiesthatcanbedrawnbetweenhallucinaKonandgeneraKvemodels.

OverfiOngHasManyFaces•  “OnewaytounderstandoverfiOng

isbydecomposinggeneralizaAonerrorintobiasandvariance”

•  Recommendedreading–  Forstudents:

AnIntroducAontoStaAsAcalLearningwithApplicaAonsinR(ISLR)

–  ForpracKKoners:

TheElementsofStaAsAcalLearning(ESLII)

ISLRandESLII•  ISLRniceintroducKontomanystaKsKcal

learningmethodsthatareo\enconsideredtoformthebasisofMachineLearning

•  ESLIIisamorein-depthtreatmentofthismaterial,be8ersuitedtopracKKoners

•  PDFsarefreelyavailableonline

•  AlsoconsiderStanfordOnline’sfreeStaKsKcalLearningcourse,whichusesISLRassourcematerial:

h8p://Knyurl.com/z3f3scm

FeatureEngineeringistheKey

•  “Oaen,therawdataisnotinaformthatisamenabletolearning,butyoucanconstructfeaturesfromitthatare.

•  “Thisistypicallywheremostoftheeffortinamachinelearningprojectgoes.”

•  Effortwillbegenerallyberequiredtogetdecentresults–  Collectyourdata–  Analyzeandunderstanditindepth–  Filter,cleanse,purge–  Aggregate,segregate,etc…

•  Titanic:MachineLearningFromDisaster

h8p://www.kaggle.com/c/Ktanic

•  TrevorStephens’Titanic:GeOngStartedwithR

h8p://trevorstephens.com/kaggle-Ktanic-tutorial/gebng-started-with-r

•  Forahands-onintroducKontoFeatureEngineering,tryouttheintroductoryTitaniccompeKKononKaggle

•  Trackyourprogressontheleaderboard,andseehowotherteamshaveachievedsomeveryimpressiveresultsgivenlimiteddata.

TitanicCompeKKononKaggle

LearnManyModels,NotJustOne

•  “thenresearchersnoAcedthat,ifinsteadofselecAngthebestvariaAonfound,wecombinemanyvariaAons,theresultsarebe0er—oaenmuchbe0er”

•  InrelaKontotheNedlixprize:

“Thewinnerandrunner-upwerebothstackedensemblesofover100learners,andcombiningthetwoensemblesfurtherimprovedtheresults.

•  Thekeyideahereisthatifyoucanaffordthelossofinterpretability,embracingensemblesyieldsgreatresults

RepresentableDoesNotImplyLearnable

•  “…justbecauseafuncAoncanberepresenteddoesnotmeanitcanbelearned…”–  Lackofdata(relaKvetonumberofpredictors)–  RepresentaKonistoolarge– Modelavailabilitygivenotherrequirements–  ConKnuousvs.discretefuncKons

•  “ThereforethekeyquesAonisnot‘Canitberepresented?’,towhichtheanswerisoaentrivial,but‘Canitbelearned?’”

Figure8.7fromISLRTopRow:Atwo-dimensionalclassificaKonexampleinwhichthetruedecisionboundaryislinear,andisindicatedbytheshadedregions.Aclassicalapproachthatassumesalinearboundary(le\)willoutperformadecisiontreethatperformssplitsparalleltotheaxes(right).BoAomRow:Herethetruedecisionboundaryisnon-linear.Herealinearmodelisunabletocapturethetruedecisionboundary(le\),whereasadecisiontreeissuccessful(right).

RepresentableDoesNotImplyLearnable

MindsetforSuccess•  Meditateonthedatayoucollectanddeeplyunderstandhow

itis,orcanbe,represented

•  Ensureyouhaveenoughgooddata,andbepreparedtoadmitthelimitaKonsofthedatayouhave

•  Cleansethedataandmassageintoaformthatisappropriateformachinelearning

•  Testdifferentmodelsandhypotheses

•  Combinemodelsforgreatereffect

•  Acceptthepossiblyoffailure

TheMasterAlgorithm•  Pedro’sbook,TheMasterAlgorithm,

linksideasfromneuroscience,evoluKon,psychology,physicsandstaKsKcsto“assembleablueprintforthefutureuniversallearner”:

h8p://a.co/34CmC2B•  ThissimilarlyKtledtalkservesasanice

introducKontothebook,coveringsomebasicideassuchashowtheschoolsofthoughtdiffer,yetaresKlllinkedbythepursuitofaUniversalLearner:

h8ps://youtu.be/B8J4uefCQMc

Thanksforlistening

AndpleasefeelfreetoaskquesKons

Date post:	19-Jan-2017
Category:	Technology
Upload:	tristan-penman
View:	162 times
Download:	3 times

PWL Seattle #23 - A Few Useful Things to Know About Machine Learning

Technology