DOEASCRAppliedMathematicsPrincipalInvestigators'(PI)Meeting,Rockville,MD,September11-12,2017
Abstract
Productiveandextreme-scalegraphcomputationsenabledbyGraphBLASAriful AzadandAydınBuluç,ComputationalResearchDivision,LawrenceBerkeleyNationalLaboratory
ü GraphBLAS-basedapproachseparatescomputation-alkernelsfromhigh-levelalgorithms&applications.
ü Booststheproductivityofapplicationssignificantly.ü Resultedinhighly-scalablealgorithmsformatching,
ordering,connectedcomponents,maximalindependentsetandgraphclustering.
Futuredirectionsandimpacts:❄ Shortterm:AGraphBLAS-compliantlibrarywithoptimizedin-nodeperformance.Explorenewdomains.❄Mediumterm:Communication-avoidingalgorithmsforGraphBLAS primitivestargetingfutureexascalesystems.Enablenewhigh-levelalgorithms.❄ Longterm:Provideeasy-to-usegraphlibrariesforbiology,scientificcomputing&machinelearning.
Areasneedinghigh-performancegraphcomputation:ü MachineLearningü ComputationalBiologyü Scientificcomputingü Quantumcomputing
Wedevelopseveralclassesofgraphalgorithmsusinglinear-algebraic(GraphBLAS)primitives.In-housecombinatorialBLASlibraryenabledrapiddevelopmentofbipartitegraphmatching,reverseCuthill-McKeeordering,trianglecounting,connectedcomponentsandMarkovclusteringalgorithmsthatscaletothousandsofcoresonmodernsupercomputers.Thesealgorithmsinturnempowerkeyscienceapplicationsincludingproteinfamilydetectionandsparselinearsolvers.
Approx.weightperfectmatching
Maximumcardinality
matching(MCM)
Maximalcardinalitymatching
16 32 64 128 256 512 1024 20482
4
8
16
32
64
128
256
Number of Cores
Tim
e (s
ec)
ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R
12x-18xspeedu
ps
~80xincreaseofcores(Edison)
Fig1.Algorithmicchaintobedeveloped
4
16
64
256
1024
1 4 16 64 256 1024
Tim
e (s
)
Number of Nodes (24 cores/node)
HipMCL MCL (van Dongen)
(1) Distributed-memorygraphmatchingü Asuiteofparallelalgorithmsdeveloped.Sparse
matrix-sparsevectormultiply,invertedindexused.ü Scalestoseveralthousandsofcoresü Impact: Removethesequential-ordering
bottleneckfromSuperLU andSTRUMPACK
❄Whygraphs?Graphcomputationdrivesmanyapplicationsinbiologyandscientificcomputing.❄WhyGraphBLAS?Thediversityandrapidevolutionofapplications,architectures&algorithmsmotivatesustoisolateasmallnumberofgraphkernelsentrustedwithdeliveringhigh-performance.❄ Expectedimpacts:(a)betterunderstandingofdataandcomputationalpatterns,(b)rapiddevelopmentofhigh-performanceapplications.
Sparse-DenseMatrixProduct
(SpDM3)
Sparse-SparseMatrixProduct (SpGEMM)
SparseMatrixTimesMul<pleDenseVectors
(SpMM)
SparseMatrix-DenseVector
(SpMV)
SparseMatrix-SparseVector(SpMSpV)
GraphBLASprimi<vesinincreasingarithme<cintensity
Shortestpaths(all-pairs,
single-source,temporal)
Graphclustering(Markovcluster,peerpressure,spectral,local)
Miscellaneous:connec<vity,traversal(BFS),independentsets(MIS),graphmatching
Centrality(PageRank,
betweenness,closeness)
Higher-levelcombinatorialandmachinelearningalgorithms
Classifica7on(supportvector
machines,Logis<cregression)
Dimensionalityreduc7on(NMF,PCA)
Wedevelopgraphandmachinelearningalgorithmsusinglinear-algebraicprimitives.Twothrusts:1.Develop communication-avoidingandwork-efficientprimitives(Aydin’sposter)2.Designalgorithmsusingoptimizedprimitives
Results ConclusionsandFutureWork
(2)Distributed-memoryMarkovclustering(HipMCL)ü Sparsematrix-matrixmultiply,connected
components,andk-selectalgorithmsused.ü Scalableupto136KcoresonNERSC/Coriü Impact: Reduceda45-dayclusteringjobtojustan
hour.Clusteredmassivenetworkswith70Bedges.
|V| |E| HipMCL(cores)
MCLshm
69M
12B
1.66hr(24K)
45day
70M
68B
2.41hr(136K) X
282M
37B
3.23hr(136K)
X
Fig.PerformanceofHipMCL w.r.t. apreviousapproach
Fig.(a)Classesofalgorithms,(b) performanceofMCM
Areasinwhichwecanhelp
Areasinwhichweneedhelp
References
Areasneedinghigh-performancegraphcomputation:ü Biology&otherdomainstounderstandapplicationsü Programminglanguageandlibraries(UPC,GASNet)ü EfficientFILEI/O(HDF5)
1. Azad,Buluç andPothen.Computingmaximumcardinalitymatchingsinparallelonbipartitegraphsviatree-grafting.TPDS,2017.
2. Azad,Jacquelin,Buluç,andNg.ThereverseCuthill-McKeealgorithmindistributed-memory.IPDPS,2017.
3. Azad,Ballard,Buluç,Demmel,Grigori,Schwartz,Toledo,& Williams.Exploitingmultiplelevelsofparallelisminsparsematrix-matrixmultiplication.SIAMJournalonScientificComputing,2016.
4. AzadandBuluç.Distributed-memoryalgorithmsformaximumcardinalitymatchinginbipartitegraphs.IPDPS,2016.
5. Azad,Buluç andGilbert.Paralleltrianglecountingandenumerationusingmatrixalgebra.IPDPSW2015.
Motivation
Approach