• Distributed-memory graph applications exhibit irregularcommunication patterns, challenging to parallelize
• We study distributed-memory implementations ofCommunity Detection (using Louvain method) andMaximumWeight Matching (half approximate method)• Partition a graph into clusters (or communities) such that
each cluster consists of vertices that are denselyconnected within the cluster and sparsely connected tothe rest of the graph
• A matching in a graph is a subset of edges such that notwo “matched” edges are incident on the same vertex
Distributed-memoryGraphAlgorithms:CasestudieswithCommunityDetectionandWeightedMatchingSayanGhosh*,MahanteshHalappanavar+,AnanthKalyanaraman*,AssefawGebremedhin*,AntoninoTumeo+
*WashingtonStateUniversity,Pullman,WA+PacificNorthwestNationalLaboratory,Richland,WA
About
Acknowledgements
Objective: Todeviseheuristicsthatimproveexecutiontimeperformanceand/orquality.
• Goodness of partitioning measured using a global metriccalled modularity (Q), that depends on sum of intra andinter community edge weights
• In 2008, Blondel, et al. introduced a multi-phase,iterative heuristic for modularity optimization, called theLouvain method
Louvainmethodforgraphclustering
Contig GenerationExperimentsconductedonNERSCCoriandEdisonsupercomputers
Performance:CommunityDetection
TheresearchisinpartsupportedbytheU.S.DOEExaGraphproject,acollaborativeeffortofU.S.DOESCandNNSAatDOEPNNL.
• S.Ghosh,M.Halappanavar,A.Tumeo,A.Kalyanaraman,H.Lu,D.Chavarrià-Miranda,A.Khan,A.Gebremedhin"DistributedLouvainAlgorithmforGraphCommunityDetection“, 2018IEEEInternationalParallelandDistributedProcessingSymposium(IPDPS)
• S.Ghosh,M.Halappanavar,A.Kalyanaraman,A.Tumeo,A.Gebremedhin,“miniVite:AGraphAnalyticsBenchmarkingToolforMassivelyParallelSystems”,2019PerformanceModeling,BenchmarkingandSimulationofHighPerformanceComputerSystems(PMBS)
• S.Ghosh,M.Halappanavar,A.Kalyanaraman,A.Khan,A.Gebremedhin,“ExploringMPICommunicationModelsforGraphApplicationsUsingGraphMatchingasaCaseStudy”[underreview]
References
HeuristicsforCommunityDetectionObjective:Implementedhalf-approxmatchingusingMPISend-Recv(NSR),Neighborhoodcollectives(NCL)andRMA.
Observed2-3.5xspeedupon4-16KprocessesforbothNCLNCL/RMAisnotefficientforthisinputandRMAversionsrelativetoNSR
RMAperformsatleast25-35%betterthanNSRandNCLLargeneighborhoodresultsinpoorperformance
Performance:Halfapproximatematching
Energy/MemoryformatchingonCori
Within each iteration• ΔQ when a vertex migrates• Move vertex from current community to one
that yields max ΔQ
At the end of a phase, the graph is rebuilt
Phase continues until ΔQ between successive iterations is below a threshold
Initially each vertex assigned to a separate community
• In the first phase, the initial set of locally dominant edgesare identified and added to matching set M
• Next phase is iterative, for each vertex in M, itsunmatched neighboring vertices are matched
N’v represents unmatched vertices in v’s neighborhood
Vertex with the heaviest unmatchededge incident on v is referred as v’smate
mate of a vertex can change as it may try to match with multiplevertices in its neighborhood
i j
kC(k)
C(i) C(j)
Penalizeavertexineveryiterationifitstaysinthesamecommunity,eventuallyitbecomesimmobileifthecumulativepenaltyfallbelowacutoff(anothervariantrequiresglobalcommunication).
Whenα iscloseto1,thisschemeismoreaggressiveinterminatingvertices,whereascloseto0isthebaselinecase.
Withoutcoloring,inparallelanegativegainscenarioispossible(sinceprocessesworkwithoutdatedinformation).WecolorafractionofverticeswithpreselectednumberofcolorclassesusingtheJones-Plassmann algorithm.
ColoringisexpensiveindistributedmemoryasentireLouvainiterationneedstobeinvokedpercolorclass,increasingthecommunicationcalls.
Initially,whenthegraphislarge,increasingτleadstoquickerexitperphase.
EarlyTermination ThresholdCycling Incompletecoloring
CommunicationcharacteristicsonNERSCCori(1Kprocesses)Matching Communitydetection Graph500BFS
Friendster(1.8Bedges) R-MATgraph(2.1Bedges)
Executiontime ModularityPerformanceofGraphChallengeInputgraphson16nodesofEdison(except200Kcase,whichwasrunon1node).Coloringperformanceis~8-10xworse,modularityimprovesbyanequalfactor!
Observed2-46xspeeduprelativetoaparallelbaselineversiononreal-worldgraphs!
ImplementedcommunicationintensivepartsusingMPIcollectives(COLL),blockingSend-Recv(SR),nonblockingSend-Recv(NBSR)andRMA.Observed4-18%divergenceinperformanceacrossversions.
VersionsyieldingthebestperformanceovertheSend-Recvbaselineversion(runon512-16Kprocesses)forinputgraphs.
• AveragememoryconsumptionforNCListheleast,~1.03−2.3xlessthanNSR,~9−27%lessthanRMA• OverallnodeenergyconsumptionofNSRisabout4xtothatofNCLandRMAforFriendster
With color W/Ocolor
Maximumweightmatching