+ All Categories
Home > Documents > High-Performance Graph Processing Programming Model on the … · high-performance graph processing...

High-Performance Graph Processing Programming Model on the … · high-performance graph processing...

Date post: 14-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
33
high-performance graph processing programming model on the gpu Yangzihao Wang January 29, 2015 University of California, Davis
Transcript
Page 1: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

high-performance graph processingprogramming model on the gpu

Yangzihao WangJanuary 29, 2015

University of California, Davis

Page 2: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Topics

Context Related works on parallel graph processing

Current Design and implementation of Gunrock

Future Research problems and next steps

1

Page 3: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Parallel Large Graph Analytics

Single-node CPU | Distributed CPU | GPU Hardwired | GPU Library

The trade-off between programmability and performance

2

Page 4: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Challenges for Graph Processing on the GPU

∙ The irregularity of data access/control flow

∙ The complexity of programming GPUs

3

Page 5: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Goal of Gunrock

Deliver the performance of GPU hardwired graphprimitives with a high-level programming model thatallows programmers to quickly develop new graphprimitives

4

Page 6: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Preliminaries

A graph is an ordered pair G =

(V, E,we,wv) comprised of a set ofvertices V together with a set ofedges E, where E ⊆ V× V.

Our primary working set is a fron-tier. A vertex frontier is a sub-set of vertices U ∈ V and an edgefrontier a subset of edges I ∈ E.

5

Page 7: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Most graph algorithms have two major operations:

Traverse Updating a frontier by traversing in the graphor subsetting the current frontier.

Compute Doing computation on edges or nodes.

6

Page 8: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7

Page 9: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7

Page 10: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

4

5

6

3

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7

Page 11: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

1

4

2

1

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7

Page 12: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Two ways to traverse:Advance Generate a new frontier by visiting the

neighbors of the current frontier.Filter Chooses a subset of the current frontier

based on programmer-specified criteria.

1

4

2

1

1

2

e3

e2

e1

e6

e8

e7

e9

e5

e4

7

Page 13: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Build High-Level Abstraction

Advance FilterUpdate

Label ValueRemove

Redundant

Advance FilterUpdate

Label ValueRemove

RedundantNear/Far Pile

Advance FilterAccumulateSigma Value

RemoveRedundant

AdvanceComputeBC Value

FilterFor e=(v1,v2), assignc[v1] to c[v2]. Removee when c[v1]==c[v2]

FilterFor v, assign

c[v] to c[c[v]]. Removev when c[v]==c[c[v]]

Advance FilterDistribute

PR value toNeighbors

Update PR value.Remove when

PR value converge

BFS: SSSP:

BC:

CC: PR:

Traversal

Computation

8

Page 14: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Develop Low-Level Optimization

∙ Workload Mapping for Advance

∙ Improving Work Efficiency

∙ Primitive-Specific Optimization

9

Page 15: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Develop Low-Level Optimization

Figure: Merrill et al. PPoPP’12

t0 t1 tn...

t0 ...t1 t31 t0 t1...warp0

t0 t1 t31 t0 t1

warp1

t0 t1 t31

warp31

... t31...

t1 ...t1 tn t0 t1 tn t0 t1 tn t1 t2 t3 t4t0... ...

Warp cooperative Advance of medium neighbor lists;

Block cooperative Advance of large neighbor lists;

Per-thread Advance of small neighbor lists.

t0

......

t0 t1 tn t0 t1 tn t0 t1 tn... ... ...

t0t0 t1 tn t0 t1 tn t0 t1 tn.........

t0t0 t1 tn t0 t1...

Block0

Block1

Block255

10

Page 16: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Develop Low-Level Optimization

Input frontier: v1,v2,v3

0

1 2 3

4 5

label=0

label=1

label=?

0 4 4 3 4 5

from v1 from v2 from v3

4 5

Explored edges (gray ones are failures)

Final output frontier

Push-basedAdvance

0

1 2 3

4 5

label=0

label=1

label=?

Pull-basedAdvance

6 7

6 7

Input queue: v4 v5 v6 v7

1 3 4 5

from v4 from v5 from v6

4 5

Explored edges (blue ones are valid ones)

Final output frontier

5

from v7

01110000frontier bitmap

11

Page 17: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Methods: Develop Low-Level Optimization

Other optimizations: Priority Queues, IdempotentOperation, Optimizing Filter, and Output Frontier Storage,etc..

12

Page 18: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Software Framework

Library Code

Utilities

Application Code

Enactor(CUDA Kernel Entry)

Functor Problem

Cond

Apply

GraphData

AppData

Operators

Advance Filter

Priority Queue Traversal Optimizations

Dynamic Cooperative

Load-balanced Partitioning

Direction Optimal

13

Page 19: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Performance Evaluation

∙ 10x better than BGL and PowerGraph;∙ Always better than other programmable GPU;∙ On par with Ligra and hardwired GPU.

14

Page 20: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

∙ What is the right programming model? (Expressivity, MutableGraph, and Performance Model)

∙ How to expand the current framework? (To Graph BLAS,Multi-Node GPUs, and Out-of-Core)

15

Page 21: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

What is the right programming model for graphprocessing on the GPU?

EnumerateNeighbors

ComputeNew Frontier

LoadBalancing

UpdateLabel Values

MarkValid

Compact output

frontierinput

frontier

Traversal:Advance Compute Traversal:Filter

Scatter Vertex-Cut Gather+Apply Scatter

GetValue

MutableValue

SendMsgTo

GetOutEdgeIterator VoteToHalt

EdgeMap(including Update) VertexMap(including Reset)

ELIST Combiner VERTEX

Gunrock:

PowerGraph:

Pregel:

Ligra: Medusa:

16

Page 22: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Expressivity Does current model cover all operations? How difficultto express one operation using current model?

Primitive Library Description Gunrock Implementation

Filter Help Remove elements from filter FilteringFiltering+Sort

FS Help Form Supervertex +Reduce+Scan+Advance+Filtering+Sort

ANV Help Aggregating Neighbor Values AdvanceLUV Help Local Update of Vertices Filtering

UVUOV Help Update Vertices Using One Other Vertex FilteringAGV Help Aggregate Global Value Filtering

Gather PowerGraph Aggregating Neighbor Values AdvanceApply PowerGraph Local Update of Vertices FilteringScatter PowerGraph Update Neighbor Values Advance

17

Page 23: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Expressivity How to support mutable graph? Two sources ofmutability:

Algorithm MST, Clustering, Mesh Refinement, etc. need newoperators such as mergeEdge, formSuperVertices,reshapeSubgraph, ...

Data Incrementally computation of node ranking orbetweenness centrality in time-series graphs. Newusers and new links in Twitter’s social graph appearconstantly. How do we support that?

18

Page 24: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Performance Model How to build a performance model to help usimprove the current programming model?

∙ Runtime as a function of # iterations;∙ Runtime as a function of {edges, vertices, etc.} touched/traversed;∙ Runtime as a function of graph parameters.

19

Page 25: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Performance Model How to build a performance model to help usimprove the current programming model?

∙ How efficient the programming model is at exploiting parallelism?∙ How performance changes as parallelism increases?∙ How well we do at GPU-low-level performance metrics? (likedegree of memory coalescing, branch coherence, etc.)

∙ How much computational work/memory bandwidth do we incur?∙ How to shift workload from memory access to computation?

20

Page 26: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Graph BLAS style: How to fit in?Sparse matrix sparse vector multiplication (SpMSpV): Linearcombination of columns specified by nonzero elements of thesparse vector.

B

= x

C A

SPA

gather scatter/ accumulate

Figure: Buluc and Gilbert, arXiv:1109.3739

21

Page 27: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Beyond A Single GPU: How to utilize morecomputing power and larger memory space?∙ Single-Node Multi-GPUs∙ Out-of-Core∙ Multi-Node GPUs

22

Page 28: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Single-Node Multi-GPUsApproach:

∙ Partition into subgraph, duplicate remote nodes∙ Multithreaded host program with multistream support∙ Reuse single-node code, partitioner as a plug-in

Issues:

∙ Best partitioner yet to be found∙ Runtime bounded by diameter and # of iterations∙ Scaling factor drops?

23

Page 29: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Single-Node Multi-GPUsResults:

∙ BFS scales across multiple GPUs when #iteration is small(<10),#edge is high (>1̃00M) and average degree is high (>80). For thebest case, 3.7x speedup using 6 GPUs (kron_n21)

∙ SSSP 1.3x speedup using 2 GPUs, BC 2.7x speedup using 4 GPUs.

24

Page 30: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Out-of-CoreApproach:

∙ Borrow the Partitioner + Single-Node Gunrock pattern∙ Overlap data movement with computation

Issue:

∙ Communication cost between each data chunk would still take upthe majority of the time.

25

Page 31: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Open Questions:

Multi-Node Multi-GPUs

∙ Target: A scalable multi-node layer using MPI∙ Approach: Add network communication operation into thesingle-node multi-GPU version

∙ Issue: Global partitioner and auxiliary arrays, make good use ofNVLink.

26

Page 32: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Acknowledgment

Gunrock team Yuechao Pan, Yuduo Wu, Carl Yang, Andrew Davidson,Andy Riffel and John D. Owens

Royal Caliber team Erich Elsen and Vishal VaidyanathanNVIDIA Duane Merrill, Sean Baxter, the amazing GPU cluster

and all other general technical supportsDARPA XDATA Program, Eric Whyne of Data Tactics Corporation,

and Dr. Christopher White.

27

Page 33: High-Performance Graph Processing Programming Model on the … · high-performance graph processing programming model on the gpu YangzihaoWang January29,2015 UniversityofCalifornia,Davis

Questions?

28


Recommended