high-performance graph processingprogramming model on the gpu
Yangzihao WangJanuary 29, 2015
University of California, Davis
Topics
Context Related works on parallel graph processing
Current Design and implementation of Gunrock
Future Research problems and next steps
1
Parallel Large Graph Analytics
Single-node CPU | Distributed CPU | GPU Hardwired | GPU Library
The trade-off between programmability and performance
2
Challenges for Graph Processing on the GPU
∙ The irregularity of data access/control flow
∙ The complexity of programming GPUs
3
Goal of Gunrock
Deliver the performance of GPU hardwired graphprimitives with a high-level programming model thatallows programmers to quickly develop new graphprimitives
4
Preliminaries
A graph is an ordered pair G =
(V, E,we,wv) comprised of a set ofvertices V together with a set ofedges E, where E ⊆ V× V.
Our primary working set is a fron-tier. A vertex frontier is a sub-set of vertices U ∈ V and an edgefrontier a subset of edges I ∈ E.
5
Methods: Build High-Level Abstraction
Most graph algorithms have two major operations:
Traverse Updating a frontier by traversing in the graphor subsetting the current frontier.
Compute Doing computation on edges or nodes.
6
Methods: Build High-Level Abstraction
Two ways to traverse:Advance Generate a new frontier by visiting the
neighbors of the current frontier.Filter Chooses a subset of the current frontier
based on programmer-specified criteria.
4
5
6
3
1
2
e3
e2
e1
e6
e8
e7
e9
e5
e4
7
Methods: Build High-Level Abstraction
Two ways to traverse:Advance Generate a new frontier by visiting the
neighbors of the current frontier.Filter Chooses a subset of the current frontier
based on programmer-specified criteria.
4
5
6
3
1
2
e3
e2
e1
e6
e8
e7
e9
e5
e4
7
Methods: Build High-Level Abstraction
Two ways to traverse:Advance Generate a new frontier by visiting the
neighbors of the current frontier.Filter Chooses a subset of the current frontier
based on programmer-specified criteria.
4
5
6
3
1
2
e3
e2
e1
e6
e8
e7
e9
e5
e4
7
Methods: Build High-Level Abstraction
Two ways to traverse:Advance Generate a new frontier by visiting the
neighbors of the current frontier.Filter Chooses a subset of the current frontier
based on programmer-specified criteria.
1
4
2
1
1
2
e3
e2
e1
e6
e8
e7
e9
e5
e4
7
Methods: Build High-Level Abstraction
Two ways to traverse:Advance Generate a new frontier by visiting the
neighbors of the current frontier.Filter Chooses a subset of the current frontier
based on programmer-specified criteria.
1
4
2
1
1
2
e3
e2
e1
e6
e8
e7
e9
e5
e4
7
Methods: Build High-Level Abstraction
Advance FilterUpdate
Label ValueRemove
Redundant
Advance FilterUpdate
Label ValueRemove
RedundantNear/Far Pile
Advance FilterAccumulateSigma Value
RemoveRedundant
AdvanceComputeBC Value
FilterFor e=(v1,v2), assignc[v1] to c[v2]. Removee when c[v1]==c[v2]
FilterFor v, assign
c[v] to c[c[v]]. Removev when c[v]==c[c[v]]
Advance FilterDistribute
PR value toNeighbors
Update PR value.Remove when
PR value converge
BFS: SSSP:
BC:
CC: PR:
Traversal
Computation
8
Methods: Develop Low-Level Optimization
∙ Workload Mapping for Advance
∙ Improving Work Efficiency
∙ Primitive-Specific Optimization
9
Methods: Develop Low-Level Optimization
Figure: Merrill et al. PPoPP’12
t0 t1 tn...
t0 ...t1 t31 t0 t1...warp0
t0 t1 t31 t0 t1
warp1
t0 t1 t31
warp31
... t31...
t1 ...t1 tn t0 t1 tn t0 t1 tn t1 t2 t3 t4t0... ...
Warp cooperative Advance of medium neighbor lists;
Block cooperative Advance of large neighbor lists;
Per-thread Advance of small neighbor lists.
t0
......
t0 t1 tn t0 t1 tn t0 t1 tn... ... ...
t0t0 t1 tn t0 t1 tn t0 t1 tn.........
t0t0 t1 tn t0 t1...
Block0
Block1
Block255
10
Methods: Develop Low-Level Optimization
Input frontier: v1,v2,v3
0
1 2 3
4 5
label=0
label=1
label=?
0 4 4 3 4 5
from v1 from v2 from v3
4 5
Explored edges (gray ones are failures)
Final output frontier
Push-basedAdvance
0
1 2 3
4 5
label=0
label=1
label=?
Pull-basedAdvance
6 7
6 7
Input queue: v4 v5 v6 v7
1 3 4 5
from v4 from v5 from v6
4 5
Explored edges (blue ones are valid ones)
Final output frontier
5
from v7
01110000frontier bitmap
11
Methods: Develop Low-Level Optimization
Other optimizations: Priority Queues, IdempotentOperation, Optimizing Filter, and Output Frontier Storage,etc..
12
Software Framework
Library Code
Utilities
Application Code
Enactor(CUDA Kernel Entry)
Functor Problem
Cond
Apply
GraphData
AppData
Operators
Advance Filter
Priority Queue Traversal Optimizations
Dynamic Cooperative
Load-balanced Partitioning
Direction Optimal
13
Performance Evaluation
∙ 10x better than BGL and PowerGraph;∙ Always better than other programmable GPU;∙ On par with Ligra and hardwired GPU.
14
Open Questions:
∙ What is the right programming model? (Expressivity, MutableGraph, and Performance Model)
∙ How to expand the current framework? (To Graph BLAS,Multi-Node GPUs, and Out-of-Core)
15
Open Questions:
What is the right programming model for graphprocessing on the GPU?
EnumerateNeighbors
ComputeNew Frontier
LoadBalancing
UpdateLabel Values
MarkValid
Compact output
frontierinput
frontier
Traversal:Advance Compute Traversal:Filter
Scatter Vertex-Cut Gather+Apply Scatter
GetValue
MutableValue
SendMsgTo
GetOutEdgeIterator VoteToHalt
EdgeMap(including Update) VertexMap(including Reset)
ELIST Combiner VERTEX
Gunrock:
PowerGraph:
Pregel:
Ligra: Medusa:
16
Open Questions:
Expressivity Does current model cover all operations? How difficultto express one operation using current model?
Primitive Library Description Gunrock Implementation
Filter Help Remove elements from filter FilteringFiltering+Sort
FS Help Form Supervertex +Reduce+Scan+Advance+Filtering+Sort
ANV Help Aggregating Neighbor Values AdvanceLUV Help Local Update of Vertices Filtering
UVUOV Help Update Vertices Using One Other Vertex FilteringAGV Help Aggregate Global Value Filtering
Gather PowerGraph Aggregating Neighbor Values AdvanceApply PowerGraph Local Update of Vertices FilteringScatter PowerGraph Update Neighbor Values Advance
17
Open Questions:
Expressivity How to support mutable graph? Two sources ofmutability:
Algorithm MST, Clustering, Mesh Refinement, etc. need newoperators such as mergeEdge, formSuperVertices,reshapeSubgraph, ...
Data Incrementally computation of node ranking orbetweenness centrality in time-series graphs. Newusers and new links in Twitter’s social graph appearconstantly. How do we support that?
18
Open Questions:
Performance Model How to build a performance model to help usimprove the current programming model?
∙ Runtime as a function of # iterations;∙ Runtime as a function of {edges, vertices, etc.} touched/traversed;∙ Runtime as a function of graph parameters.
19
Open Questions:
Performance Model How to build a performance model to help usimprove the current programming model?
∙ How efficient the programming model is at exploiting parallelism?∙ How performance changes as parallelism increases?∙ How well we do at GPU-low-level performance metrics? (likedegree of memory coalescing, branch coherence, etc.)
∙ How much computational work/memory bandwidth do we incur?∙ How to shift workload from memory access to computation?
20
Open Questions:
Graph BLAS style: How to fit in?Sparse matrix sparse vector multiplication (SpMSpV): Linearcombination of columns specified by nonzero elements of thesparse vector.
B
= x
C A
SPA
gather scatter/ accumulate
Figure: Buluc and Gilbert, arXiv:1109.3739
21
Open Questions:
Beyond A Single GPU: How to utilize morecomputing power and larger memory space?∙ Single-Node Multi-GPUs∙ Out-of-Core∙ Multi-Node GPUs
22
Open Questions:
Single-Node Multi-GPUsApproach:
∙ Partition into subgraph, duplicate remote nodes∙ Multithreaded host program with multistream support∙ Reuse single-node code, partitioner as a plug-in
Issues:
∙ Best partitioner yet to be found∙ Runtime bounded by diameter and # of iterations∙ Scaling factor drops?
23
Open Questions:
Single-Node Multi-GPUsResults:
∙ BFS scales across multiple GPUs when #iteration is small(<10),#edge is high (>1̃00M) and average degree is high (>80). For thebest case, 3.7x speedup using 6 GPUs (kron_n21)
∙ SSSP 1.3x speedup using 2 GPUs, BC 2.7x speedup using 4 GPUs.
24
Open Questions:
Out-of-CoreApproach:
∙ Borrow the Partitioner + Single-Node Gunrock pattern∙ Overlap data movement with computation
Issue:
∙ Communication cost between each data chunk would still take upthe majority of the time.
25
Open Questions:
Multi-Node Multi-GPUs
∙ Target: A scalable multi-node layer using MPI∙ Approach: Add network communication operation into thesingle-node multi-GPU version
∙ Issue: Global partitioner and auxiliary arrays, make good use ofNVLink.
26
Acknowledgment
Gunrock team Yuechao Pan, Yuduo Wu, Carl Yang, Andrew Davidson,Andy Riffel and John D. Owens
Royal Caliber team Erich Elsen and Vishal VaidyanathanNVIDIA Duane Merrill, Sean Baxter, the amazing GPU cluster
and all other general technical supportsDARPA XDATA Program, Eric Whyne of Data Tactics Corporation,
and Dr. Christopher White.
27
Questions?
28