The MADlib Analytics Library or MAD Skills, the SQL
CalebWelton FlorianSchoppmann ChristopherRé
1
MADlib
Scalable Machine Learning for BigData
2
Traditional analytics pipeline
sample.csv
Time-to-Insights
DataPrep DBExtract DBImportspec.docx scores.csv
3
The MAD approach
EnterpriseData
RDBMS RDBMSRDBMS RDBMS
Time-to-Insights
DataPrep Model Score
ReducedData
Movement
Billionsofrows
inminutes
4
MADlib in Action
HospitalAdmiLanceCaseStudy
5
MADlib in Action
Step1:
• IdenOfyhighriskpaOents
Goal:
• HighriskpaOentswillbeeligibleforearlyadmiLance
andbeadministeredpreempOveanObioOcs
6
MADlib in Action
Step2:
• Buildcostmodelfortreatment
y
xGoal:Goal:
• Predictexpectedcostoftreatment
• WithandwithoutearlyadmiLance.
7
MADlib in Action
Step3:
• OpOmizeearlyadmiLancebasedonriskandcostmodel
Goal:
• OverallhospitalcostswillbeminimizedandpaOents
willreceivebeLercare.
8
MADlib cycle of success
Value!
IdenOfy
Problem
UseMath
Insights
Whydidn’tI
thinkofthat
before?
9
TheMADlibVision
• AcademicandindustrycontribuOons
• Thinkof“CRANfordatabases”
– Repositoryofopen-sourceMLalgorithms
– ThisOmewithdataparallelisminmind
• Open-SourceFramework
EigenBSDLicense10
SimpleExample:
OrdinaryLeastSquares
# SELECT (linregr(y, x)).* FROM data; !
-[ RECORD 1 ]+------------------------ !
coef | {1.7307,2.2428} !
r2 | 0.9475 !
std_err | {0.3258,0.0533} !
t_stats | {5.3127,42.0640} !
p_values | {6.7681e-07,4.4409e-16} !
condition_no | 169.5093!
# SELECT y, x[1] AS x1, x[2] AS x2 FROM data! y | x1 | x2 !-------+------+----- ! 10.14 | 0 | 0.3 ! 11.93 | 0.69 | 0.6 ! 13.57 | 1.1 | 0.9 ! 14.17 | 1.39 | 1.2 ! 15.25 | 1.61 | 1.5 ! 16.15 | 1.79 | 1.8 !
X y
y
x
11
LinearAlgebraintheDatabase
XT
X
XT
y
-1
XTX XTy
( )
︸ ︷︷ ︸ ︸ ︷︷ ︸
β̂ = (X TX )−1X T
y
xixi
T
i=1
n
∑ xiyii=1
n
∑12
BasicBuildingBlock:
User-DefinedAggregates
AggregaOonphase1oneachnode:
1. IniOalize:
2. TransiOonforallrows:
3. Send(A,b)
x y
(1,0,3,…,5) 3
(-2,4,5,…,2) 2
… …
(A,b) = (0,0)
(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y)
map
reduce
(A,b)
…AggregaOonphase2onmasternode:
1. Merge:
2. Finalize: β̂ = solve(A,b) = A−1⋅b
(A,b) = (A,b)+ (A,b)
13
Problemsolved?
No–notyet.
14
MLAlgorithmsBasedonSQL?
• FourRepresentaOveChallenges
1. LackofportablemulO-passiteraOons
2. Rootsinfirst-orderlogic
3. Lackoflanguagesupportforlinearalgebra
4. ExtensibleSQLlimitedtosmallworkingsets
Need:
• AbstracOonLayers
• Afewcompromisesforuserinterface
15
1.LackofportablemulO-pass
iteraOons
• WITH RECURSIVEnotreliablebasisfor
portability
• User-defineddriver
funcOonsinPython
– Outerloopsnot
performance-criOcal
• Compromise:
Differentuserinterface
CREATE TEMP TABLE temp !
INSERT INTO temp SELECT step(...) FROM ... !
SELECT converged(...) FROM temp, ... !
SELECT result(...) !FROM temp!
false
true
16
2.Rootsinfirst-orderlogic
• Queriesneedbecognizantofdatabaseobjects
• Emulatehigher-orderlogicby:– dynamicexecuOonoftemplatedSQL
– abstracOon-layersupport
• Example:DistanceorkernelfuncOons
• OnPostgreSQL,useoftypeREGPROC
FunctionHandle dist ! = args[0].getAs<FunctionHandle>(); !return dist(x, y);
17
3.Lackoflanguagesupportfor
linearalgebra
• C++AbstracOonLayerusesEigen
• (Dense)Vectorsandmatrices:DOUBLE PRECISION[]!
• Example:
AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:
• Nounnecessarycopying
• Nointernaltypeconversion
18
4.ExtensibleSQLlimitedto
smallworkingsets
• TablesonlyportableopOonforlargestates
• AccessfromUDAssloworimpossible
• Example:k-meansbenefitsfromexplicitpoint-to-centroidassignments– ProblemaOc:
UPDATE points SET centroid_id = closest(state, coords)
– Requiresownpass
– Notallowedinsubqueries
– PostgreSQLlegacy
19
MADlibArchitecture
RDBMSQueryProcessing
(Greenplum,PostgreSQL,…)
Low-levelAbstracBonLayer
(matrixoperaOons,C++toRDBMStype
bridge,…)
RDBMS
Built-in
FuncBons
UserInterface
High-levelAbstracBonLayer
(iteraOoncontroller,convexopOmizers,...)
Row-levelFuncBons
(innerloopsofstreamingalgorithms,
convexopOmizaOoncallbacks,...)
“Driver”FuncBons
(outerloopsofiteraOvealgorithms,opOmizer
invocaOons)
C++
Python
Pythonwith
templatedSQL
SQL,generated
fromspecificaOon
20
AnatomyofaniteraOveMADlibmodule
interState=Start(args)
Repeat
Inparallelforeachsegment:
intraState=IniOalize(interState)
Foreachrow
intraState=Transit(intraState,row)
ForeachintraState:
intraState=Merge(oldIntraState,intraState)
interState=Finalize(intraState)
UnBlConverged(interState)
ReturnEnd(interState)
User-defined
Aggregate
User-definedFuncOon
PythonDriverFuncOon
User-definedFuncOon
21
PerformanceTrends
• DiskI/Oisnotalways
theboLleneck• Performancetuningis
essenOal
• Overheadforsingle
queryverylow(fracOon
ofasecond)
• Greenplumachieves
nearlyperfectspeedup0
5
10
15
20
25
30
35
40
6 12 18 24
20 40 80 160
OLSon10millionrows(inseconds)
#segments
#variables:
22
CurrentModules
DataModeling
SupervisedLearning
• NaiveBayesClassificaOon
• LinearRegression
• LogisOcRegression
• DecisionTree
• RandomForest
• SupportVectorMachines
UnsupervisedLearning
• AssociaOonRules
• k-MeansClustering
• SVDMatrixFactorizaOon
• ParallelLatentDirichletAllocaOon
DescripOveStaOsOcs
Sketch-basedEsOmators
• CountMin(Cormode-Muthukrishnan)
• FM(Flajolet-MarOn)
• MFV(MostFrequentValues)
Profile
QuanOle
Support
ArrayOperaOons
ConjugateGradient
SparseVectors
ProbabilityFuncOons
FeatureExtracOon
InferenOalStaOsOcs
Hypothesistests23
MyMADlibExperience:
ATesBmonial.
ChristopherRé,Wisconsin 24
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
RefiningIdeasandCode
QAfromGPhelptotransiOon
frompapertodeployedcode.
ConversaOonswithGP(andOracle)leadusto
beLerposiOonourSIGMOD12paper
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
25
MADlibisOpenSource
Learning&Inferencerunon(GPorPostgres)+MADLib
Cri5cal:it’sfree,open,andwecanmodifyit
hazy.cs.wisc.edu&www.youtube.com/HazyResearch
EnhanceWikipedia
withextractedfacts
fromtheWeb
(50+TBofdata)
26
TesOmonialSummary
MADlibisopentocontribuOonsandopensource
27
Questions?
hLp://madlib.net
CalebWelton
FlorianSchoppmann
ChristopherRé