Jingjing Wang,TobinBaker,MagdalenaBalazinska,DanielHalperin,BrandonHaynes,BillHowe,DylanHutchison,Shrainik Jain,RyanMaas,Parmita Mehta,DominikMoritz,BrandonMyers, JenniferOrtiz,Dan
Suciu,AndrewWhitaker,Shengliang XuDEPARTMENT OF COMPUTER SCIENCE &ENGINEERING
UNIVERSITY OF WASHINGTONhttp://myria.cs.washington.edu
TheMyria BigDataManagementandAnalyticsSystemandCloudService
Acknowledgments
TheMyria Team!Oursciencecollaborators!!• AndrewConnolly,TomQuinn,SarahLoebman,ArielRokem,GingerArmbrust,Yejin Choi
Oursponsors!!!• NationalScienceFoundation,Moore&SloanFoundations,WashingtonResearchFoundation,eScience Institute,ISTCBigData,Petrobras,EMC,Amazon,andFacebook
2MagdalenaBalazinska- UniversityofWashington
BigData
MagdalenaBalazinska - UniversityofWashington 3
Management
Analytics
Efficient Easy
ScienceApps
GoalsoftheMyria stack• Advancestate-of-the-artinbigdatasystems• Focusonefficiencyandproductivity• Testonrealapplicationsandsupportrealusers
Deliverables:• Builtanewbigdatamgmt &analyticssystem• DeployedandoperateMyria asaservice• Sourcecodeanddemoservice:http://myria.cs.washington.edu
4MagdalenaBalazinska- UniversityofWashington
5MagdalenaBalazinska- UniversityofWashington
Myria hasbeendevelopedandisoperatedby• DatabaseGroupintheComputerScience&EngineeringDepartmentatUW
• UWeScience Institute
Co-PIs:DanSuciu andBillHowe
6
Myria Demo
MagdalenaBalazinska- UniversityofWashington
Myria CloudService
MagdalenaBalazinska- UniversityofWashington 7
Serviceavailablethroughprojectwebsite
AnalysisintheBrowserwithMyria
MagdalenaBalazinska- UniversityofWashington 8
Declarative-imperativeanalysiswithMyriaL andPython
Myria OperatesDirectlyonDatainS3
MagdalenaBalazinska- UniversityofWashington 9
Forefficientprocessing,cachesqueryresultsinternallyincluster
MyriaL isImperative+DeclarativewithIterations
MagdalenaBalazinska- UniversityofWashington 10
Myria ProvidesDetailsofQueryExecution
MagdalenaBalazinska- UniversityofWashington 11
Myria ServiceincludesJupyter Notebook
MagdalenaBalazinska- UniversityofWashington 12
Jupyter notebookavailabledirectlywithMyria service
Myria SupportsPythonUser-DefinedFunctions
MagdalenaBalazinska- UniversityofWashington 13
DatafromtheHumanConnectomeproject
MRIdataanalysis
PythonUDFsenablerunninglegacycodeandcomplexanalyticsbeyondSQL/MyriaL
UsersCanDeployOwnService
pip install myria-cluster
MagdalenaBalazinska- UniversityofWashington 14
myria-cluster create [OPTIONS] CLUSTER_NAME
myria-cluster stop/start/destroy […]
ExampleMyria Applications
15
NeuroscienceAstronomy
NaturalLanguageProcessing
PicturefromLeilaZillesMyMergerTree Screenshot
DatafromtheHumanConnectome project
Oceanography
100
101
102
103
104
100
101
102
103
104
ps3.fcs…subset
FSC
692-40
RED
fluo
resc
ence
FSC
Picoplankton
Nanoplankton
100
101
102
103
104
100
101
102
103
104
P35-surf
FSC Small Stuff
58
0-3
0
IS
Ultraplankton
100
101
102
103
104
100
101
102
103
104
P35-surf
FSC Small Stuff
69
2-4
0 litt
le s
tuff
Phytoplankton
Prochlorococcus
Bibliometrics
16
Myria Internals
MagdalenaBalazinska- UniversityofWashington
Myria Polystore Stack
Browser SpecializedServices
RACO
MyMergerTree
QueryTranslation,Optimization,andOrchestration
Python/Jupyter
Parallel, Iterative, and Elastic Query
Execution
MyriaXMPI
SciDB
Graphs
NoSQL
MagdalenaBalazinska- UniversityofWashington 17
Myria’s DataModelandQueryInterface• RelationalAlgebraCompiler(RACO)
– Myria’s queryoptimizerandfederator• RACOcore:relationalalgebraextendedwith
– Iterations formulti-passalgorithms– Flatmap toexplodenon-1NFattributevaluesintomanytuples– Stateful apply forwindowedandneighborhoodfunctions
• Querylanguage:MyriaL (Imperative+Declarative)– Eachstatementisdeclarative(SQL,comprehensions,functioncalls)– Statementsarecombinedwithimperativeconstructs
• Variableassignment• Iteration
• PythonUDFs/UDAs– Minimizebarrierstoadoptionandrunlegacycode
• PythonAPI– FluentAPIwithPythonlambdafunctions
MagdalenaBalazinska- UniversityofWashington 18
Polystore Optimization• Rule-basedopt.withthreetypesofrules
– OptimizelogicalMyria algebraplans– Translatelogicalplansintoback-endspecificphysicalplans– Optimizeback-endspecificphysicalplans
• Toaddanewback-end,developermustspecify– Treerepresentationofquerylanguage– RulesthattranslateMyria algebraintothisrepresentation– Administrativefunctionsincludingonetosubmitqueries
• Datamodelindependence– Myria hidestheexistenceofvariousback-ends– UserswriteMyriaL scriptsassumingrelationalmodel– Back-endsincludeselectarray,graph,andkey-valuesystems
MagdalenaBalazinska- UniversityofWashington 19
FederatedQueryExecution
Federatedplansrequirefastdatamovement
MagdalenaBalazinska- UniversityofWashington 20
Worker1
Worker"
SourceDBMS
User
t = scan(data)x = distances(t,t)export(x,'db://Target')
x = import('db://Source')u = cluster(x)
WorkerDirectorysource.w1à target.wmsource.wnà target.w1
[1] [2]
[3]
[4]
Worker1
Worker#
TargetDBMS
…
UserorOpt.
DataMovementwithPipeGen
A+
DBMSBytecode
UnitTests
PipeGen
Pipegen-EnabledDBMS
21
PipeGen:DataPipeGeneratorforHybridAnalyticsBrandonHaynes,AlvinCheung,andMagdalenaBalazinska.SOCC2016.
DBMSbytecode
DBMS with optimizeddata pipe
PipeVerify:Verification
IORedirect: I/O RedirectorIdentify
File Open Expressions
InjectConditional Redirection
InstrumentUnit Tests
InstrumentUnit Tests
Data Flow Analysis
Type Substitution
FormOpt: Format Optimizer
Data Pipe Type
Augmented Types
PipeGen’s Performance
MagdalenaBalazinska- UniversityofWashington 22
16-nodeclusterwith16workers/tasksTransfer10^9tupleswith4ints and3doubles
Myria Polystore Stack
Browser SpecializedServices
RACO
MyMergerTree
QueryTranslation,Optimization,andOrchestration
Python/Jupyter
Parallel, Iterative, and Elastic Query
Execution
MyriaXMPI
SciDB
Graphs
NoSQL
MagdalenaBalazinska- UniversityofWashington 23
MyriaX EngineandCloudDeployment
MagdalenaBalazinska- UniversityofWashington 24
AmazonEC2Instance
JSONqueryplans&APIcalls
CoordinatorREST Interface
Worker
HDFSAmazonEBSVolumesand/orLocalStorage
RDBMS
AmazonS3
Worker
YARNContainer
Worker
YARNContainer
YARNContainer
… …
YARNContainer
AmazonEC2Instance
RDBMS RDBMS
AmazonEC2Instance
… …
MyriaX Overview
25MagdalenaBalazinska- UniversityofWashington
• Datastorage– ReaddatafromS3,HDFS,localfiles– ParseCSV,TSV,andvariousscientificfileformats– StoredatainlocalrelationalDBMSinstances
• Faststoragewithphysicaltuning(indexing,hash-partitioning)
• Queryexecution– FundamentallyaparallelDBMS
• Fast,pipelinedqueryexecution– Butschedulingmoreflexibletosupportelasticity– Novelfeatures:Multiwayjoinsanditerations
• Resourcemanagement– ExecutesontopoftheYARNresourcemanager
EfficientIterativeProcessing
• Userspecifiesquerydeclaratively– SubsetofDatalog withaggregation
• Generateefficient,shared-nothingqueryplan– Smallextensions to existingshared-nothingsystems
• Planamenabletoruntimeoptimizations– Synchronousvsasynchronous– Differentprocessingpriorities
• OptimizationssignificantlyaffectperformanceMagdalenaBalazinska- UniversityofWashington 26
AsynchronousandFault-TolerantRecursiveDatalogEvaluationinShared-NothingEnginesJingjing Wang,MagdalenaBalazinska,andDanielHalperin.PVLDB 8(12):1542-1553(2015)
Myria’s OptimizedIterationsExample
Declarative QueryE = scan(jwang:cc:graph);V = select distinct E.$0 from E;doCC := [$0, MIN($1)] <-[from V emit V.$0 as x, V.$0 as y] +[from E, CC where E.$0 = CC.$0 emit E.$1, CC.$1];
until convergence;store(CC, CC);
MagdalenaBalazinska - UniversityofWashington 27
AsynchronousandFault-TolerantRecursiveDatalogEvaluationinShared-NothingEnginesJingjing Wang,MagdalenaBalazinska,andDanielHalperin.PVLDB 8(12):1542-1553(2015)
//Canhave multiple relations//with recursive dep.
IDBController(CC) Scan(Edges)
Join
Scan(Edges)
Compiled to a Distributed Query Plan
PerformanceComparisonwithSparkDeclarativeQuery
(subsetofDatalog withagg.)
Shared-NothingQueryPlanIn-MemoryProcessing
Synchronous
Asynchronous
PrioritizeNewData PrioritizeBaseData
28
# of Workers8 16 32 64
0
50
100
150
200
250
Que
ry T
ime
(Sec
onds
)
Spark Myria, Sync Myria, Async
(GraphX) 28
ConnectedComponents– Twittersubgraph221millionedgesand5millionvertices
Myria Polystore Stack
Browser SpecializedServices
RACO
MyMergerTree
QueryTranslation,Optimization,andOrchestration
Python/Jupyter
Parallel, Iterative, and Elastic Query
Execution
MyriaXMPI
SciDB
Graphs
NoSQL
MagdalenaBalazinska- UniversityofWashington 29
MagdalenaBalazinska- UniversityofWashington 30
CloudOperationinMyria
OrpointtodatainAmazonS3
Myria’s PersonalizedServiceLevelAgreements
31
ChangingtheFaceofDatabaseCloudServiceswithPersonalizedServiceLevelAgreementsJenniferOrtiz,VictorT.Almeida,andMagdalenaBalazinska.CIDR2015
MagdalenaBalazinska- UniversityofWashington
WorkloadCompressionintoPSLA
WorkloadGeneration
QueryClustering
TemplateGeneration
Cross-TierPruning PSLASchema
RuntimePrediction
Myria’s SLAgeneration
Myria’s PerfEnforce Subsystem
32
PerfEnforceDemonstration:DataAnalyticswithPerformanceGuaranteesJenniferOrtiz,BrendanLee,andMagdalenaBalazinska.SIGMOD2016.
MagdalenaBalazinska- UniversityofWashington
MagdalenaBalazinska - UniversityofWashington
Myria’s PerfEnforce Subsystem
33
Clustersizechangesduringquerysession
PerfEnforceDemonstration:DataAnalyticswithPerformanceGuaranteesJenniferOrtiz,BrendanLee,andMagdalenaBalazinska.SIGMOD2016.
AutomaticDataPipes
ImageProcessingPerf.Debugging
CloudPSLAs
Myria CloudOperation
PerformanceGuarantees ElasticMemory
EfficientMulti-Join IterativeQueries
EfficientProcessing&ComplexAnalyticswithMyriaX
DataSummaries
Myria’s InnovationsSummary
Myria Polystore
FederatedAnalytics
MagdalenaBalazinska- UniversityofWashington 34
Conclusion• Highlyexpressive
– MyriaL (RA+iterations)&Python• Polystore withhybridanalytics• Highperformanceonvarietyofqueries• Availableasaservice
– Focusonlowbarriertoentry– Andturningusersintoself-sufficientexperts– Alsofocusontheserviceprovider:OperateMyria
• Sourcecodeandmoreinfo(includesvideos)http://myria.cs.washington.edu/
35MagdalenaBalazinska- UniversityofWashington