Large-Scale Graph Processing on Emerging Storage Devices...Prior External Graph Processing --...

Post on 14-Oct-2020

7 views 0 download

transcript

Large-Scale Graph Processing on Emerging Storage Devices

Nima Elyasi1, Changho Choi2, Anand Sivasubramaniam1

1Pennsylvania State University

2Samsung Semiconductor Inc.

Graph Processing is Commonplace

2

Search Engines Social MediaRecommendations

and AdsMap and

Navigation

Large-Scale Graph Processing Challenges

Huge Datasets Irregular Accesses

5

High cost of DRAM

$$$$

DRAM

Large-Scale Graph Processing Challenges

Huge Datasets Irregular Accesses

External Graph Processing is Desirable

5

High cost of DRAM

$$$$

NVMe SSD

$

DRAM

Large-Scale Graph Processing Challenges

Huge Datasets Irregular Accesses

External Graph Processing is Desirable

5

High cost of DRAM

$$$$

NVMe SSD

$

DRAM

Large-Scale Graph Processing Challenges

Huge Datasets Irregular Accesses

External Graph Processing is Desirable

5

High cost of DRAM

$$$$

NVMe SSDFine-Grained and Random

Accesses

$

DRAM

Fine-Grained Access in External Graph Processing

5

SSD Page Size and Vertex Accesses Don’t Match!

SSD Page 0 SSD Page 1

SSD Page

Several KiloBytes(4KB ~ 16KB) Several Bytes,

e.g., 4Bytes

Vertex Value

Irregular Accesses

Fine-Grained Access in External Graph Processing

5

SSD Page Size and Vertex Accesses Don’t Match!

SSD Page 0 SSD Page 1

SSD Page

Several KiloBytes(4KB ~ 16KB)

Vertex updates are detrimental to:

Performance Device Endurance

Several Bytes, e.g., 4Bytes

Vertex Value

Irregular Accesses

Providing Perfect Sequentiality as a Remedy

9

• If vertex data could be stored on DRAM• Fine-grained accesses was less of an issue

GraFBoost, ISCA’18Instead, prior external graph processing framework maintains vertex data on SSD

Providing Perfect Sequentiality as a Remedy

10

• If vertex data could be stored on DRAM• Fine-grained accesses was less of an issue

GraFBoost, ISCA’18Instead, prior external graph processing framework maintains vertex data on SSD

Achieves perfect sequentiality by coalescing fine-grained accesses

Programming Model

5

Vertex-centric Programming Model

- Iterative programming model

- Each vertex runs a user-defined program

- Sending updates to neighbors along outgoing edges

A

Prior External Graph Processing -- GraFBoost

12

Vertex Data Index File

Edge File

Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.

Prior External Graph Processing -- GraFBoost

13

V0 → Vx

V0 → Vy

V0 → Vz

Vertex Data Index File

Edge File

Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.

Prior External Graph Processing -- GraFBoost

14

Keys: {Vx , Vy , Vz}, Value: {V0 value}

<Vx,V0 value>, <Vy,V0 value>, <Vz,V0 value>

V0 → Vx

V0 → Vy

V0 → Vz

Vertex Data Index File

Edge File

Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.

Prior External Graph Processing -- GraFBoost

15

Keys: {Vx , Vy , Vz}, Value: {V0 value}

<Vx,V0 value>, <Vy,V0 value>, <Vz,V0 value>

V0 → Vx

V0 → Vy

V0 → VzGraFBoost sorts key-value pairs in memory, logs them in SSD, merges them, and updates vertex list in SSD

Vertex Data Index File

Edge File

Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.

Computation Overhead of Sort!

16

• Up to 60% sort overhead (web graph)

• Higher sort overhead for PageRank- Processes all vertices in each iteration and generates more updates

Current External Graph Processing:

Read from SSD Sort in Memory Write to SSD

Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)

Scalability Issue

17

Current External Graph Processing:

Read from SSD Sort in Memory Write to SSD

Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)

Assuming DRAM “k” times faster than SSD (e.g., k=30):

Scalability Issue

18

When k < log(|E|) → Sorting can become bottleneck

Current External Graph Processing:

Read from SSD Sort in Memory Write to SSD

Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)

Assuming DRAM “k” times faster than SSD (e.g., k=30):

Scalability Issue

19

When k < log(|E|) → Sorting can become bottleneck

Current External Graph Processing:

Read from SSD Sort in Memory Write to SSD

Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)

Assuming DRAM “k” times faster than SSD (e.g., k=30):

Scalability Issue

20

When k < log(|E|) → Sorting can become bottleneck

Instead, we propose a vertex partitioning to eliminate the sorting

Extensive Prior Efforts on Partitioning Graph Data:

- Not well suited for fully external graph processing

Partitioning Graph Data

21

Require all vertices be present in main memory

Do not decouple vertices and edges

FlashGraph, FAST’15 GraphChi, OSDI’12, Mosaic, EuroSys’17

PowerGraph, OSDI’12GridGraph, USENIX ATC’15

GraphP, HPCA’18

Need each partition be completely present in

cache or memory

Dramatically increasing number of partitions and incurring high cross-

partition communication

Reorganizing graph data so that vertices associated with each partition can fit in main memory

Instead, We Propose a Partitioning for Vertex Data

22

Source Vertices

Destination Vertices

Reorganizing graph data so that vertices associated with each partition can fit in main memory

Instead, We Propose a Partitioning for Vertex Data

23

Sou

rce V

erte

x D

ata

Vertex ID &

ValueIndex

Vertex A Offset A

Vertex B Offset B

Vertex C Offset C

Partition 0

So

rted B

ased

on

Vertex

ID

Edge Data

Vertex A

Out-edge

Vertex A

Out-edge

Vertex B

Out-edge

Vertex C

Out-edge

Source Vertices

Destination Vertices

In each iteration:

Execution Flow

24

SSD

Vertex Data

Destination

Vertex for a

partition

Memory

In each iteration:

Execution Flow

25

Sou

rce V

erte

x D

ata

Vertex ID &

ValueIndex

Vertex A Offset A

Vertex B Offset B

Vertex C Offset C

Partition 0

SSD

Vertex Data

Destination

Vertex for a

partition

A Chunk

of Source

Vertex

(32MB)

Update Destination

Vertices

Reading

Neighboring

Information

Memory Memory

Stre

am

ing F

rom

SS

D

In each iteration:

Execution Flow

26

Sou

rce V

erte

x D

ata

Vertex ID &

ValueIndex

Vertex A Offset A

Vertex B Offset B

Vertex C Offset C

Partition 0

SSD

Vertex Data

Destination

Vertex for a

partition

A Chunk

of Source

Vertex

(32MB)

Write all

updated

vertex data

on SSD

Update Destination

Vertices

Reading

Neighboring

Information

Generate Mirror Updates

for other partitions

Meta-

data for

current

partition

Memory Memory Memory

Stre

am

ing F

rom

SS

D

In each iteration:

Execution Flow

27

Sou

rce V

erte

x D

ata

Vertex ID &

ValueIndex

Vertex A Offset A

Vertex B Offset B

Vertex C Offset C

Partition 0

SSD

Vertex Data

Destination

Vertex for a

partition

A Chunk

of Source

Vertex

(32MB)

Write all

updated

vertex data

on SSD

Update Destination

Vertices

Reading

Neighboring

Information

Generate Mirror Updates

for other partitions

Meta-

data for

current

partition

Memory Memory Memory

Stre

am

ing F

rom

SS

D

How to Update Vertex List in Main Memory?

Multiple threads are updating elements of the same vertex list- High synchronization cost

Updating Vertices in Memory

28

Vertex List

Multiple threads are updating elements of the same vertex list- High synchronization cost

Updating Vertices in Memory

29

Vertex List

Buffer,

e.g., 1MB

Required Meta-Data for Mirror Updates

Updating Vertex Mirrors on Different Partitions

30O(|V|) running time for updating mirrors

Vertex Value

Partition i

Mirrors for

Partition 0

Source Vertex Table

Vertex 0

Part ID(s)

Vertex 1

Part ID(s)

Vertex 2

Part ID(s)

Partition 0

Start Index

End Index

Partition i

Start Index

End Index

For each

partition

Metadata

For each

VertexStart Index

End Index

Experimental Setup

• Processor: Intel Xeon -- 48 Cores

• Memory: DRAM – 256 GB

• SSD: Two Samsung NVMe SSDs - 3.2 TB capacity in total, and 6.4 GB/s Sequential Read Speed

• Graph Algorithms: - PageRank and Breadth-First-Search (BFS)

• Input Graphs:- Web, Twitter, Synthetic (Kron)

31

Performance Evaluation

32

• More than 2X Improvement Compared to GrafSoft

• Providing Higher Benefits for larger graphs (Web, Kron32)

• Incurring around 10% space overhead for partitioning

Execution Time Breakdown

33

• Mirror updates account for 8-12% of execution time

• I/O does not remain the main contributor to the total execution time

PageRank

Concluding Remarks

• Large-scale graph processing suffers from random updates to vertices

• State-of-the-art provides perfect sequentiality by sorting all updates

- High computation overhead

• A partitioning for vertex data is proposed to eliminate the need for perfect sequentiality

• In Future: Addressing timely evolving graphs

• Thanks to GraFboost authors (Sang-Woo Jun) !34

Thanks!

35