High-Performance Graph Processing Programming Model on the … · high-performance graph processing...

high-performance graph processingprogramming model on the gpu

Yangzihao WangJanuary 29, 2015

University of California, Davis

Topics

Context Related works on parallel graph processing

Current Design and implementation of Gunrock

Future Research problems and next steps

1

Parallel Large Graph Analytics

Single-node CPU | Distributed CPU | GPU Hardwired | GPU Library

The trade-off between programmability and performance

2

Challenges for Graph Processing on the GPU

∙ The irregularity of data access/control flow

∙ The complexity of programming GPUs

3

Goal of Gunrock

Deliver the performance of GPU hardwired graphprimitives with a high-level programming model thatallows programmers to quickly develop new graphprimitives

4

Preliminaries

A graph is an ordered pair G =

(V, E,we,wv) comprised of a set ofvertices V together with a set ofedges E, where E ⊆ V× V.

Our primary working set is a fron-tier. A vertex frontier is a sub-set of vertices U ∈ V and an edgefrontier a subset of edges I ∈ E.

5

Methods: Build High-Level Abstraction

Most graph algorithms have two major operations:

Traverse Updating a frontier by traversing in the graphor subsetting the current frontier.

Compute Doing computation on edges or nodes.

6


Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7





4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7





4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7





1

4

2

1

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7





1

4

2

1

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7


Advance FilterUpdate

Label ValueRemove

Redundant

Advance FilterUpdate

Label ValueRemove

RedundantNear/Far Pile

Advance FilterAccumulateSigma Value

RemoveRedundant

AdvanceComputeBC Value

FilterFor e=(v1,v2), assignc[v1] to c[v2]. Removee when c[v1]==c[v2]

FilterFor v, assign

c[v] to c[c[v]]. Removev when c[v]==c[c[v]]

Advance FilterDistribute

PR value toNeighbors

Update PR value.Remove when

PR value converge

BFS: SSSP:

BC:

CC: PR:

Traversal

Computation

8

Methods: Develop Low-Level Optimization

∙ Workload Mapping for Advance

∙ Improving Work Efficiency

∙ Primitive-Specific Optimization

9


Figure: Merrill et al. PPoPP’12

t0 t1 tn...

t0 ...t1 t31 t0 t1...warp0

t0 t1 t31 t0 t1

warp1

t0 t1 t31

warp31

... t31...

t1 ...t1 tn t0 t1 tn t0 t1 tn t1 t2 t3 t4t0... ...

Warp cooperative Advance of medium neighbor lists;

Block cooperative Advance of large neighbor lists;

Per-thread Advance of small neighbor lists.

t0

......

t0 t1 tn t0 t1 tn t0 t1 tn... ... ...

t0t0 t1 tn t0 t1 tn t0 t1 tn.........

t0t0 t1 tn t0 t1...

Block0

Block1

Block255

10


Input frontier: v1,v2,v3

0

1 2 3

4 5

label=0

label=1

label=?

0 4 4 3 4 5

from v1 from v2 from v3

4 5

Explored edges (gray ones are failures)

Final output frontier

Push-basedAdvance

0

1 2 3

4 5

label=0

label=1

label=?

Pull-basedAdvance

6 7

6 7

Input queue: v4 v5 v6 v7

1 3 4 5

from v4 from v5 from v6

4 5

Explored edges (blue ones are valid ones)

Final output frontier

5

from v7

01110000frontier bitmap

11


Other optimizations: Priority Queues, IdempotentOperation, Optimizing Filter, and Output Frontier Storage,etc..

12

Software Framework

Library Code

Utilities

Application Code

Enactor(CUDA Kernel Entry)

Functor Problem

Cond

Apply

GraphData

AppData

Operators

Advance Filter

Priority Queue Traversal Optimizations

Dynamic Cooperative

Load-balanced Partitioning

Direction Optimal

13

Performance Evaluation

∙ 10x better than BGL and PowerGraph;∙ Always better than other programmable GPU;∙ On par with Ligra and hardwired GPU.

14

Open Questions:

∙ What is the right programming model? (Expressivity, MutableGraph, and Performance Model)

∙ How to expand the current framework? (To Graph BLAS,Multi-Node GPUs, and Out-of-Core)

15

Open Questions:

What is the right programming model for graphprocessing on the GPU?

EnumerateNeighbors

ComputeNew Frontier

LoadBalancing

UpdateLabel Values

MarkValid

Compact output

frontierinput

frontier

Traversal:Advance Compute Traversal:Filter

Scatter Vertex-Cut Gather+Apply Scatter

GetValue

MutableValue

SendMsgTo

GetOutEdgeIterator VoteToHalt

EdgeMap(including Update) VertexMap(including Reset)

ELIST Combiner VERTEX

Gunrock:

PowerGraph:

Pregel:

Ligra: Medusa:

16

Open Questions:

Expressivity Does current model cover all operations? How difficultto express one operation using current model?

Primitive Library Description Gunrock Implementation

Filter Help Remove elements from filter FilteringFiltering+Sort

FS Help Form Supervertex +Reduce+Scan+Advance+Filtering+Sort

ANV Help Aggregating Neighbor Values AdvanceLUV Help Local Update of Vertices Filtering

UVUOV Help Update Vertices Using One Other Vertex FilteringAGV Help Aggregate Global Value Filtering

Gather PowerGraph Aggregating Neighbor Values AdvanceApply PowerGraph Local Update of Vertices FilteringScatter PowerGraph Update Neighbor Values Advance

17

Open Questions:

Expressivity How to support mutable graph? Two sources ofmutability:

Algorithm MST, Clustering, Mesh Refinement, etc. need newoperators such as mergeEdge, formSuperVertices,reshapeSubgraph, ...

Data Incrementally computation of node ranking orbetweenness centrality in time-series graphs. Newusers and new links in Twitter’s social graph appearconstantly. How do we support that?

18

Open Questions:

Performance Model How to build a performance model to help usimprove the current programming model?

∙ Runtime as a function of # iterations;∙ Runtime as a function of {edges, vertices, etc.} touched/traversed;∙ Runtime as a function of graph parameters.

19

Open Questions:

Performance Model How to build a performance model to help usimprove the current programming model?

∙ How efficient the programming model is at exploiting parallelism?∙ How performance changes as parallelism increases?∙ How well we do at GPU-low-level performance metrics? (likedegree of memory coalescing, branch coherence, etc.)

∙ How much computational work/memory bandwidth do we incur?∙ How to shift workload from memory access to computation?

20

Open Questions:

Graph BLAS style: How to fit in?Sparse matrix sparse vector multiplication (SpMSpV): Linearcombination of columns specified by nonzero elements of thesparse vector.

B

= x

C A

SPA

gather scatter/ accumulate

Figure: Buluc and Gilbert, arXiv:1109.3739

21

Open Questions:

Beyond A Single GPU: How to utilize morecomputing power and larger memory space?∙ Single-Node Multi-GPUs∙ Out-of-Core∙ Multi-Node GPUs

22

Open Questions:

Single-Node Multi-GPUsApproach:

∙ Partition into subgraph, duplicate remote nodes∙ Multithreaded host program with multistream support∙ Reuse single-node code, partitioner as a plug-in

Issues:

∙ Best partitioner yet to be found∙ Runtime bounded by diameter and # of iterations∙ Scaling factor drops?

23

Open Questions:

Single-Node Multi-GPUsResults:

∙ BFS scales across multiple GPUs when #iteration is small(<10),#edge is high (>1̃00M) and average degree is high (>80). For thebest case, 3.7x speedup using 6 GPUs (kron_n21)

∙ SSSP 1.3x speedup using 2 GPUs, BC 2.7x speedup using 4 GPUs.

24

Open Questions:

Out-of-CoreApproach:

∙ Borrow the Partitioner + Single-Node Gunrock pattern∙ Overlap data movement with computation

Issue:

∙ Communication cost between each data chunk would still take upthe majority of the time.

25

Open Questions:

Multi-Node Multi-GPUs

∙ Target: A scalable multi-node layer using MPI∙ Approach: Add network communication operation into thesingle-node multi-GPU version

∙ Issue: Global partitioner and auxiliary arrays, make good use ofNVLink.

26

Acknowledgment

Gunrock team Yuechao Pan, Yuduo Wu, Carl Yang, Andrew Davidson,Andy Riffel and John D. Owens

Royal Caliber team Erich Elsen and Vishal VaidyanathanNVIDIA Duane Merrill, Sean Baxter, the amazing GPU cluster

and all other general technical supportsDARPA XDATA Program, Eric Whyne of Data Tactics Corporation,

and Dr. Christopher White.

27

Questions?

28

Date post:	14-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

High-Performance Graph Processing Programming Model on the … · high-performance graph processing...

Documents