+ All Categories
Home > Documents > If you were plowing a field, which would you rather use?

If you were plowing a field, which would you rather use?

Date post: 11-Jan-2016
Category:
Upload: pilis
View: 13 times
Download: 0 times
Share this document with a friend
Description:
If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). Our ‘field’ to plow : Graph processing. |V| = 1.4B, |E| = 6.6B. Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab - PowerPoint PPT Presentation
30
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Transcript
Page 1: If you were plowing a field,  which would you rather use?

1

If you were plowing a field, which would you rather

use?Two oxen, or 1024 chickens?(Attributed to S. Cray)

Page 2: If you were plowing a field,  which would you rather use?

2

Our ‘field’ to plow : Graph processing

|V| = 1.4B, |E| = 6.6B

Page 3: If you were plowing a field,  which would you rather use?

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto

Matei Ripeanu

NetSysLabThe University of British Columbia

http://netsyslab.ece.ubc.ca

Page 4: If you were plowing a field,  which would you rather use?

Graph Processing: The Challenges

Data-dependent memory access

patterns

Caches + summary data structures

Large memory footprint

>128GB

CPUs

Poor locality

Varying degrees of parallelism(both intra- and inter- stage)

Low compute-to-memory access

ratio

Page 5: If you were plowing a field,  which would you rather use?

Graph Processing: The GPU Opportunity

Data-dependent memory access

patterns

Caches + summary data structures

Large memory footprint

>128GB

CPUs

Poor locality

Assemble a heterogeneous platform

Massive hardware multithreading

6GB!

GPUs

Low compute-to-memory access

ratio

Caches + summary data structures

Varying degrees of parallelism(both intra- and inter- stage)

Page 6: If you were plowing a field,  which would you rather use?

6

YES WE CAN!2x speedup (8 billion edges)

Motivating Question

Can we efficiently use hybrid systems for

large-scale graph processing?

Page 7: If you were plowing a field,  which would you rather use?

7

Performance Model Predicts speedup Intuitive

Totem A graph processing engine for hybrid

systems Applies algorithm-agnostic optimizations

Evaluation Predicated vs. achieved Hybrid vs. Symmetric

Methodology

Page 8: If you were plowing a field,  which would you rather use?

8

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPUb

The Performance Model (I)

GgpuGcpu

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

c

rSpeedup

cpu

1

Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

)(/ msizeofbc =

Page 9: If you were plowing a field,  which would you rather use?

9

GgpuGcpu

The Performance Model (II)

β = 20% rcpu = 0.5 BEPS

It is beneficial to process the graph on a hybrid system

if communication overhead is low

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPU

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

)(/ msizeofbc =

c

rSpeedup

cpu

1

Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes

=> c = 1 billion EPS

x

Best reported single-node BFS performance

[Agarwal, V. 2010]

|V| = 32M, |E| = 1B

Worst case (e.g., bipartite graph)

Page 10: If you were plowing a field,  which would you rather use?

10

Totem: Programming ModelBulk Synchronous Parallel

Rounds of computation and communication phases

Updates to remote vertices are delivered in the next round

Partitions vote to terminate execution

.

.

.

Page 11: If you were plowing a field,  which would you rather use?

11

Totem: A BSP-based Engine

1

0

3

2

4

5

CPU GPU0

3

2

1S S

outbox

0 20 1

outbox

40

0 1 2 3 4 5

0 1 3 54 6

10 20 30 40 40E

V

01 21 21

0 1 2 3

0 1 3 3

31 40 E

V

2121

Compressed sparse row representation

Computation: kernel manipulates local state

Updates to remote vertices aggregated locally

Comm2: merge with local state

Comm1: transfer outbox buffer to remote input buffer

inbox4

inbox

0 2

Page 12: If you were plowing a field,  which would you rather use?

12

|E| = 512 Million

Random

The Aggregation Opportunity

real-world graphs are mostly scale-free: skewed

degree distribution

sparse graph: ~5x reduction

Denser graph has better opportunity for aggregation:

~50x reduction

Page 13: If you were plowing a field,  which would you rather use?

13

Evaluation Setup

Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted

Algorithms Breadth-first Search PageRank

Metrics Speedup compared to processing on the host only

Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB

Page 14: If you were plowing a field,  which would you rather use?

14

Predicted vs. Achieved Speedup

Linear speedup with respect to offloaded part

GPU partition fills GPU memory

After aggregation, β = 2%. A low value

is critical for BFS

Page 15: If you were plowing a field,  which would you rather use?

15

Breakdown of Execution Time

PageRank is dominated by the compute phase

Aggregation significantly reduced communication

overhead

GPU is > 5x faster than the host

Page 16: If you were plowing a field,  which would you rather use?

16

So far …

Performance modeling Simple Useful for initial system provisioning

Totem Generic graph processing framework Algorithm-agnostic optimizations

Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU

system

But, random partitioning! Can we do better?

Page 17: If you were plowing a field,  which would you rather use?

Better partitioning strategies.

The search space. Handles large (billion-edge scale) graphs.

o Low space and time complexity.o Ideally, quasi-linear!

Handles well scale-free graphs. Minimizes algorithm’s execution time by

reducing computation time o (rather than communication)

17

Page 18: If you were plowing a field,  which would you rather use?

The strategies we explore HIGH: vertices with high degree left on the

hostLOW: vertices with low degree left on the

host RAND: random

18The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

Page 19: If you were plowing a field,  which would you rather use?

Evaluation platform

19

Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075

(2x sockets) (2x GPUs)

Core Frequency 2.67GHz 1.15GHz

Num Cores (SMs) 6 14

HW-thread/Core 2 x 48warps (x32/warp)

Last Level Cache 12MB 2MB

Main Memory 144GB 6GB

Memory Bandwidth 32GB/sec 144GB/sec

Total Power (TDP) 95W 225W

Page 20: If you were plowing a field,  which would you rather use?

BSF performance

20BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

2x performance gain!

LOW: No gain over random!

Exploring the 75% data point

Page 21: If you were plowing a field,  which would you rather use?

Host is the bottleneck in all cases !

BSF performance – more details

Page 22: If you were plowing a field,  which would you rather use?

PageRank performance

22PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

LOW: Minimal gain over random!

Better packing

25% performance gain!

Page 23: If you were plowing a field,  which would you rather use?

Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m)

23

BFS PageRank

Intelligent partitioning provides benefits Key for performance: load balancing

Page 24: If you were plowing a field,  which would you rather use?

Uniform graphs (not scale free)

24

BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28

Hybrid techniques not useful for uniform graphs

Page 25: If you were plowing a field,  which would you rather use?

Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU

25

BFS PageRank

Page 26: If you were plowing a field,  which would you rather use?

Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt

26

BFS PageRank

2.0 1.3 1.1 1.3 1.3

1.9

2.3

2.4 1.8 1.01.4

1.8

Page 27: If you were plowing a field,  which would you rather use?

Conclusions

Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs)

Q: Can one design a processing engine for hybrid platforms that both generic and efficient?

A: Yes.

Q: Are there near-linear complexity partitioning strategies that enable higher performance?

A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random.

Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)?

A: No. (for scale free graphs)

Q: Which strategies work best?A: It depends!

Large graphs: shape the load. Small graphs: load balancing.

Page 28: If you were plowing a field,  which would you rather use?

28

If you were plowing a field, which would you rather

use?- Two oxen, or 1024 chickens?- Both!

Page 29: If you were plowing a field,  which would you rather use?

29

code available at: netsyslab.ece.ubc.ca

Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph

Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012

• On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013

29

Page 30: If you were plowing a field,  which would you rather use?

30

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia


Recommended