Complexity vs. Performance: Empirical Analysis of Machine ... · Complexity vs. Performance:...

transcript

Complexityvs.Performance:EmpiricalAnalysisofMachineLearningas

aService

Yuanshun Yao,Zhujun Xiao,BolunWang*,Bimal Viswanath,Haitao ZhengandBenY.Zhao

TheUniversityofChicago*UniversityofCalifornia,SantaBarbara

ysyao@cs.uchicago.edu

MLinNetworkResearch

congestioncontrolprotocols

• Sivaraman etal.,SIGCOMM’14

• Winstein &Balakrishnan,SIGCOMM’13

networklinkprediction

• Liuetal.,IMC’16• Zhaoetal.,IMC’12

userbehavioranalysis

• Wangetal.,IMC’14• Zannettouet al.,IMC’17

RunningMLisHard

dataset

Solution:MachineLearningasaService

(ML-as-a-Service)

ML-as-a-Service

trainingdata

userinput(model,parameteretc.)

Ismymodelgoodenough?

WhyStudyML-as-a-Service?

Q:Howwelldotheyperform?

Q:HowmuchdoestheamountofusercontrolimpactMLperformance?

ML-as-a-ServicePlatforms

GooglePrediction

AmazonML

MicrosoftML

PIOABM BigML

less amountofuserinput more

ControlinML

trainingdata trainedmodel

ControlinML

DataCleaning• Invalid/dup/missingdata

ControlinML

FeatureSelection• MutualInfo, Pearson,Chi…

ControlinML

trainingdata

ClassifierChoice• LogisticRegression,DecisionTree,kNN…

trainedmodel

FeatureSelection• MutualInfo, Pearson,Chi_square…

ControlinML

trainingdata

ClassifierChoice• LogisticRegression,Decision Tree,kNN…

trainedmodel

FeatureSelection• MutualInfo, Pearson,Chi_square…

ParameterTuning• LogisticRegression:L1,L2,max_iter…

ControlinML-as-a-Service

Google ABM

Amazon

PIO BigML

Microsoft

low usercontrol/complexity high

DataCleaning

FeatureSelection

ClassifierChoice

ParameterTuning

Complexity vs.Performance?

PerformanceMeasurement

CharacterizingPerformance• Theoreticalmodelingishard• OutputofMLmodeldependsondataset• Noaccesstoimplementationdetails

• Empiricaldata-drivenanalysis• Simulateareal-worldscenariofromendtoend• Needalargenumberofdiversedatasets

• Focusonbinaryclassification

Dataset• 119datasets• Fromdiverseapplicationdomains• Samplesize:15- 245K,numberoffeatures:1- 4K• 79%ofthemarefromUCIMLRepository

LifeScience37%

ComputerApplications15%

ArtificialTest14%

SocialScience9%

PhysicalScience8%

Financial&Business6%

Other11%

Methodology• Tuneallavailablecontroldimensions

trainingdata

trainedmodel

Feature Selection Classifier Choice Parameter Tuning

✖✔ ✔API

• LogisticRegression• KNN• SVM• … API

• L1_reg• L2_reg• Max_iter• … API

Methodology• Tuneallavailablecontroldimensions

trainingdata

trainedmodel

Feature Selection Classifier Choice Parameter Tuning

✖✔ ✔API

testingdata

Trade-offsbetweenComplexityandPerformance

Complexityvs.Performance

complexitylow high

• Q:Howdoesthecomplexitycorrelatewithperformance?• Highcomplexity->highperformance

ABM Google Amazon BigML PIO Microsoft Scikit

AverageF-Score Optimized

Complexityvs.Risk• Q:Howdoestheriskcorrelatewithcomplexity?• Highcomplexity->highrisk

complexitylow high

ABM Google Amazon BigML PIO Microsoft Scikit

rmanceVariance

(F-Score)

UnderstandingServer-sideOptimization

Reverse-engineeringOptimization

-1.5 -1 -0.5 0 0.5 1 1.5

Feature #1

Class 0 Class 1

-3 -2 -1 0 1 2 3

Feature #1

Class 0

Class 1

Circular Linear

• Q:Doesserver-sideadapttodifferentdatasets?

• Reverser-engineeringusingdatasets• Createsyntheticdatasets• Usepredictionresultstoinferclassifierinformation

UnderstandingOptimizationGoogledecisionboundaries

-1.5 -1 -0.5 0 0.5 1 1.5

Feature #1

Class 0 Class 1

-3 -2 -1 0 1 2 3

Feature #1

Class 0Class 1

• Googleswitchesbetweenclassifiersbasedonthedataset

• Usesupervisedlearningtoinferclassifierfamilyused

Takeaways•ML-as-a-Serviceisanattractivetooltoreduceworkload

• Butusercontrolstillhasalargeimpactonperformance

• Fullyautomatedsystemsarelessrisky

Thankyou!Questions?

Complexity vs. Performance: Empirical Analysis of Machine ... · Complexity vs. Performance:...

Documents