GraphM sc'19sc19.supercomputing.org/proceedings/tech_paper/... · GraphM: An Efficient Storage...

SC 2019

GraphM: An Efficient Storage System for High Throughput of Concurrent Graph Processing

Jin Zhao1, Yu Zhang1, Xiaofei Liao1, Ligang He2, Bingsheng He3, Hai Jin1, Haikun Liu1, Yicheng Chen1

1 SCTS/CGCL, Huazhong University of Science and Technology, China

2 University of Warwick, UK3 National University of Singapore, Singapore

• Background and Challenges

• GraphM

• Experimental Results

• Conclusion

Outline

Social network Internet Road network

• Graph processing → rapidly growing demand in the real world

…

Graph Processing

• Graphs → ubiquitously preferred data representation

35

30

25

20

15

10

5

00 20 40 60 80 100 120 140 160N

um

ber

of

con

curr

ent

job

s

Times(hours)

Number of jobs traced on a social network

More than 30 jobs are

concurrently executed on the

same platform at the peak time

Concurrent Graph Processing

• Many concurrent graph processing jobs are often handled on the same underlying

graph (or its different snapshots) to provide various information for different products

K-means SSSP

…

PageRank

Graph Processing Framework

Shared

Graph Data

Concurrent iterative

graph processing jobs

ChaosPowerGraph GraphChi …GridGraph

Existing Graph Processing Systems

• Higher sequential memory bandwidth

• Better data locality

• Smaller memory consumption

• Fewer redundant data accesses

…

Mainly designed to efficiently handle single graph processing job

ChaosPowerGraph GraphChi …GridGraph

Existing Graph Processing Systems

• Higher sequential memory bandwidth

• Better data locality

• Smaller memory consumption

• Fewer redundant data accesses

…

Mainly designed to efficiently handle single graph processing job

Graph

copy

Graph

copy

Graph

copy

Memory

Cache

Job-specific data

Job-specificdata

Job-specific data

Job 1 Job 2 Job 3

Secondary storage

Graph

data

Execution of concurrent graph processing jobs on existing systems

Reason 1: A lot of redundant consumption of

memory resources and data access channels

0

2

4

6

8

1 2 4 8

Mem

ory

usa

ge

(GB

)

Number of concurrent jobs

PagerankWCCBFSSSSP

(a) Total memory usage

0

2

4

6

8

1 2 4 8

Mem

ory

usa

ge

(GB

)


PagerankWCCBFSSSSP

(a) Total memory usage

0

20

40

60

80

100

120

1 2 4 8

LL

C m

isse

s (B

illi

ons)


PageRankWCCBFSSSSP

(b) Total last-level cache misses

Mem

ory

usa

ge(G

B)

LL

C m

isse

s (B

illions)

Challenges: Redundant Data Access Overhead

The memory usage and the total amount of graph data loaded into LLC

increases when more concurrent jobs executed on the same platform

The LLC misses per instruction (LPI) and average execution time of each job

increases when more concurrent jobs executed on the same platform

Challenges: Redundant Data Access Overhead

0

100

200

300

400

500

600

1 2 4 8Number of concurrent jobs

PageRankWCCBFSSSSP

(d) Average execution time

0.006

0.007

0.008

0.009

0.010

1 2 4 8

Num

ber

of

LP

I

Number of concurrent jobs(c) Average number of LPI

PageRankWCCBFSSSSP

Ex

ecu

tio

n t

ime

(Sec

on

ds)

Nu

mb

er o

f L

PI

Reason 2: Serious contention for storage

resources and data access channels

Observations

5

6

7

8

9

1 2 3 4 5 6

Aver

age

dat

a ac

cess

tim

es

Time (hours)(b) Average data access times

60

70

80

90

100

1 2 3 4 5 6Time (hours)

(a) Percentage of shared graph

#>1 #>2 #>4 #>8

Perc

enta

ge

share

d b

y #

job

s (%

)A

ver

age d

ata a

cces

s ti

mes

Spatial Similarity

Most proportion of the same graph can be shared

by multiple concurrent jobs during the traversals

Temporal Similarity

Information traced on the social network

Observations

5

6

7

8

9

1 2 3 4 5 6

Aver

age

dat

a ac

cess

tim

es

Time (hours)(b) Average data access times

60

70

80

90

100

1 2 3 4 5 6Time (hours)

(a) Percentage of shared graph

#>1 #>2 #>4 #>8

Perc

enta

ge

share

d b

y #

job

s (%

)A

ver

age d

ata a

cces

s ti

mes

Spatial Similarity

Most proportion of the same graph can be shared

by multiple concurrent jobs during the traversals

Temporal Similarity

The same graph data may be repeatedly accessed

by different concurrent jobs over a period of time

Information traced on the social network

Motivations

Spatial Similarity

Maintain a single copy of the same graph structure data in the storage to

serve the concurrent jobs

➢ How to utilize the spatial/temporal similarities?

Temporal Similarity

Motivations

Spatial Similarity

Maintain a single copy of the same graph structure data in the storage to

serve the concurrent jobs

➢ How to utilize the spatial/temporal similarities?

Temporal Similarity

Consolidate the accesses to the same graph structure data for

concurrent jobs

Motivations

➢ How to utilize the spatial/temporal similarities in a practical way?

Option #1: Design a graph processing framework

• New programming model for graph algorithms

• Requires changes in user-level applications

Motivations

➢ How to utilize the spatial/temporal similarities in a practical way?

Option #2: Develop a graph storage system

• Several APIs for existing graph processing frameworks

• Can be transparent to application programmers

Option #1: Design a graph processing framework

• New programming model for graph algorithms

• Requires changes in user-level applications


• GraphM


• Conclusion

Outline

GraphM Explores…

• Traditional graph storage approach

D1 = (V1, E1, W1, S1)

D2 = (V2, E2, W2, S2)

DJ = (VJ, EJ, WJ, SJ )

• Main goals

D1 = (V1, E1, W1, S1)

D2 = (V2, E2, W2, S2)

DJ = (VJ, EJ, WJ, SJ )

17

…

Most graph structure data G=(V, E, W) is the same for different

concurrent graph processing jobs

V, E, WThe storage of the same graph structure data and the data access

to it can be shared by different concurrent graph processing jobs

GraphM Explores…

• The challenges of utilizing the similarities

- Concurrent jobs access the shared partitions in an

individual manner along different graph paths

- The processing time of each graph structure partition

is various for different jobs

• Our expectations

- Load the shared graph partitions along a common

order for concurrent jobs

- Take into account the temporal similarities when

loading the shared graph partitions

P2 P3

P1 P2

P1 P2 P3 P4

Job 1

Job 2

Job 3

TimeAn Iteration of Graph Processing

Iteration n1 for Job 1



P4

P2 P3

P2 P1 P4

P1 P2 P3 P4

Job 1

Job 2

Job 3

TimeAn Iteration of Graph Processing




Overview of GraphM

We design a structure-aware graph labelling scheme

We propose a Sharing-Synchronize mechanism

We develop a graph partition loading strategy

We enable GraphM without user-application change

Partition Labeling

Original graph

(v0, v1)

(v0, v2)

(v3, v1)

(v4, v2)

1

3

5

0

4

2P1

P2

P3

P4

(v0, v3)

(v0, v5)

(v2, v4)

(v2, v5)

(v5, v3)

(v5, v4)

Graph representation

format specific to GridGraph

Partition Labeling

• Graph partitions are traversed once and logically labeled as chunks

Original graph

(v0, v1)

(v0, v2)

(v3, v1)

(v4, v2)

1

3

5

0

4

2P1

P2

P3

P4

(v0, v3)

(v0, v5)

(v2, v4)

(v2, v5)

(v5, v3)

(v5, v4)

Each chunk consists of two edges

Graph representation


Partition Labeling

• Graph partitions are traversed once and logically labeled as chunks

Original graphGraph representation


(v0, v1)

(v0, v2)

(v3, v1)

(v4, v2)

1

3

5

0

4

2P1

P2

P3

P4

(v0, v3)

(v0, v5)

(v2, v4)

(v2, v5)

(v5, v3)

(v5, v4)

Each chunk consists of two edges

<v0, 2><v3, 1><v4, 1> <v0, 2> … Chunk tables

ID of source vertex

Number of it’s

outgoing edges

Chunk 1 Chunk 2 Chunk 3

Share-Synchronize Mechanism

G = {P1, P2, P3, P4}Disk

P3

S1 S2…

Load of graph structure data

S3

Memory

Suspend ones Executable ones

Job 1 Job 2 Job 3…

• Memory Sharing of Graph Structure (Sharing())

- Load an assigned active partition

- Resume or suspend corresponding jobs

- Share the graph structure data


G = {P1, P2, P3, P4}Disk

P3

S1 S2…


S3

Memory

• The amount of edges that need to be

processed is different

• The computational complexity of the

edge processing function is different

e1P3

e2 e3 e4 Job 1

Job 3e1P3

e2 e3 e4












• Fine-grained Synchronization

- Profiling phase

- Obtain T(E), T(Fj)

- Syncing phase

G = {P1, P2, P3, P4}Disk

P3

S1 S2…


S3

Memory



T(E): The average data access time for each edge

T(Fj):The computational complexity of the edge processing

function of job j






• Fine-grained Synchronization

- Profiling phase

- Obtain T(E), T(Fj)

- Syncing phase

- Acquire workloads of each chunk

- Unevenly allocate CPU resources

G = {P1, P2, P3, P4}Disk

P3

S1 S2…


S3

Memory

CacheChunk

Regular streaming of chunks



T(E): The average data access time for each edge

T(Fj):The computational complexity of the edge processing

function of job j

Ensuring of Consistent Snapshots

• The mutations (by some jobs) and updates (over time) of the shared graph structure

data are isolated among concurrent jobs to ensure the correctness of the processing

Phys ical address Virtual address

of Job 2

Shared memory

Mutation 2

Chunk 1

Virtual address of Job 1

copy

copy

low

high

Graph Structureof Job 1

Shared Graph Structure

Update 3

Graph Structureof Job 2

Copy 3 Chunk 4

Mutation 2

Chunk 4

Chunk 3Chunk 2

Chunk 1Chunk 4

Update 3

Chunk 2

Chunk 1

Job 1 is submitted before Job 2, Chunk 3 is updated after Job 1 is submitted, and Chunk 2 is modified by Job 2

Scheduling Strategy for Partition Loading

• The partitions are given a higher priority

- when they are handled by the jobs with fewer active partitions

- when they are processed by more jobs

Partition 1 is activated by the other partitions of job 1 in 𝑥𝑡ℎ iteration and can be handled at the (𝑥 + 1)𝑡ℎ

iteration for job 1

System Architecture

OS

Our Graph Storage System

Existing Graph Processing Engine

Graph API

Existing Graph Processing Framework

User

Application

User

Application

Original graph data

Chunk

tables

LLCChunk 1

Graph Partition

Chunk 1 Chunk 2

Chunk 3 Chunk 4

Graph Partition

Chunk 1 Chunk 2

Chunk 3 Chunk 4

Graph partition

Chunk 1 Chunk 2

Chunk 3 Chunk 4

Specific graph

representation

Chunk 1Chunk 1

CP

UD

isk

Mem

ory

Graph preprocessor

Graph sharing controller

Synchronization manager

User Application

Chunk

tables

User Application

Integrated with Existing Framework

No burden on programmers + Minor framework change

An example to illustrate how to integrate GraphM into existing graph processing framework

GraphM.Init() /*Initialization of GraphM*/StreamEdges(){ /*Setup the active partitions*/ GraphM.GetActiveVertices() for(each active partition){ partition GraphM.Sharing(G, load()) /*Notify GraphM to start synchronization*/ GraphM.Start() for(each edge partition) /*Process the streamed edges*/ /*Notify GraphM to end synchronization*/ GraphM.Barrier() }}

/*Edge streaming function in GridGraph*/

StreamEdges(){

/*Setup the active partitions*/

for(each active partition){

/* The original data load operation*/

partition load()

for(each edge partition)

/*Process the streamed edges*/

}

}

(a) Pseudocode of GridGraph (b) Pseudocode of GridGraph integrated with GraphM


• GraphM


• Conclusion

Outline

• Machine information

- CPU: 2-way 8-core Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz

- each CPU has 20 MB Last-Level Cache

- Main Memory: 32GB

• Typical graph processing algorithms

- PageRank, WCC, BFS, SSSP

• Datasets

- 5 real world datasets

• Evaluated graph processing systems

- GraphChi, GridGraph, PowerGraph, Chaos

Datasets Vertices Edges Data sizes

LiveJ 4.8 M 69 M 526 MB

Orkut 3.1 M 117.2 M 894 MB

Twitter 41.7 M 1.5 B 10.9 GB

UK-union 133.6 M 5.5 B 40.1 GB

Clueweb12 978.4 M 42.6 B 317 GB

Experiment Setup

Properties of data sets

< 32 GB

> 32 GB

• Machine information

- CPU: 2-way 8-core Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz

- each CPU has 20 MB Last-Level Cache

- Main Memory: 32GB

• Typical graph processing algorithms

- PageRank, WCC, BFS, SSSP

• Datasets

- 5 real world datasets

• Evaluated graph processing systems

- GraphChi, GridGraph, PowerGraph, Chaos

Experiment Setup

GridGraph-M GridGraph-S GridGraph-C

Datasets Vertices Edges Data sizes

LiveJ 4.8 M 69 M 526 MB

Orkut 3.1 M 117.2 M 894 MB

Twitter 41.7 M 1.5 B 10.9 GB

UK-union 133.6 M 5.5 B 40.1 GB

Clueweb12 978.4 M 42.6 B 317 GB

Properties of data sets

< 32 GB

> 32 GB

Evaluation: Overall Performance

LiveJ Orkut Twitter UK-union Clueweb0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Norm

aliz

ed e

xec

uti

on t

ime

Data sets

GridGraph-S GridGraph-C GridGraph-M


0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

Gri

dG

raph

-MG

ridG

raph

-C

Tim

e co

nsu

mpti

on r

atio

Data sets

Graph processing time Data accessing time

Gri

dG

raph

-S

Gri

dG

raph

-S

Gri

dG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Total execution time for the 16 jobs with

different schemes

Execution time breakdown of jobs with

different schemes

• Shorter total execution time

• Much lower graph data accessing cost

Evaluation: Overall Performance


0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Norm

aliz

ed e

xec

uti

on t

ime

Data sets



0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

Gri

dG

raph

-MG

ridG

raph

-C

Tim

e co

nsu

mpti

on r

atio

Data sets

Graph processing time Data accessing time

Gri

dG

raph

-S

Gri

dG

raph

-S

Gri

dG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Gri

dG

raph

-SG

ridG

raph

-CG

ridG

raph

-M

Total execution time for the 16 jobs with

different schemes

Execution time breakdown of jobs with

different schemes

• Shorter total execution time

• Much lower graph data accessing cost

The data accessing time is

reduced by 11.48 times and

13.06 times

Evaluation: Volume of Data Access

Total I/O overhead for 16 jobs with

different schemes

Volume of data swapped into the LLC for

16 jobs with different schemes


0.2

0.4

0.6

0.8

1.0

1.2

Norm

aliz

ed V

olu

me

Data sets



0.2

0.4

0.6

0.8

1.0

1.2

No

rmal

ized

I/O

ov

erh

ead

Data sets


• Less volume of data access

• Reduce a mass of I/O overhead in the case of out-of-core processing

Evaluation: Scalability

Total execution time for different

number of jobs

Total execution time on different

number of CPU cores

1 2 4 8 160

100

200

300

400

500

Exec

uti

on t

ime

(hours

)

Number of jobs


1 2 4 8 160.0

0.2

0.4

0.6

0.8

Exec

uti

on t

ime

(hours

)

Number of CPU cores


• Better speedup is achieved when the number of jobs increases

• Simply adopting original frameworks to support concurrent jobs may be a terrible choice

Evaluation: Integration with Other Frameworks

Execution time (in seconds) of 64 jobs for different frameworks, where “—”

means it cannot be executed for memory errors. PowerGraph and Chaos are done

on a cluster with 128 nodes, which is connected via 1-Gigabit Ethernet.

LiveJ Orkut Twitter UK-union Clueweb12

GraphChi-S 2,348 2,248 43,032 149,352 > 1 week

GraphChi-C 776 696 10,580 38,760 > 1 week

GraphChi-M 344 468 6,128 12,436 248,840

PowerGraph-S 92 144 1,408 7,183 —

PowerGraph-C 83 111 1,153 6,653 —

PowerGraph-M 43 75 795 3,820 —

Chaos-S 224 159 4,668 29,538 487,272

Chaos-C 516 588 12,011 30,943 > 1 week

Chaos-M 121 106 2,261 10,614 156,881

• Pre-processing → format converting, graph partition labelling

• Result → small extra overhead

- Can be amortized by reuse

Evaluation: Pre-processing Cost


Extra Size 70.6 MB 49.2 MB 2.09 GB 4.5 GB 19.9 GB

Extra Ratio 13.4% 5.5% 19.2% 11.2% 6.3%


GridGraph 20.89 35.07 439.59 2,312.11 19,267.28

GridGraph-M 21.86 35.76 463.65 2,681.04 22,401.90

Extra Ratio 4.6% 2.0% 5.5% 16.0% 16.3%

Preprocessing time (in seconds)

Extra storage cost


• GraphM


• Conclusion

Outline

Conclusion

➢What GraphM brings in graph processing

• Analysis of spatial/temporal similarities between concurrent graph processing jobs

• A novel Share-Synchronize mechanism for concurrent graph processing

• A scheduling strategy for out-of-core graph processing

• Requires no application change and only minor change in framework

➢ Future work

• How to exploit the use of new hardware (e.g., FPGA even ASIC) to accelerate data accesses

of concurrent jobs for higher throughput

• How to further optimize GraphM for distributed platforms and for evolving graphs processing

• How to further ensure the security

THANK YOU!

Service Computing Technology and System Lab., MoE (SCTS)

Cluster and Grid Computing Lab., Hubei Province (CGCL)

BACKUP SLIDES

Fine-grained Synchronization Execution

• Profiling Phase

• Syncing Phase

Ci: The set of chunks in the partition Pi

Vk: The set of vertices in the kth chunk

Aj: The set of active vertices for the jth job

: The number of out-going edges of the vertex v in the kth chunk

Priorities of Graph Partitions

• Ji: The set of jobs to handle Pi in the next iteration

• Nj (P): The number of active partitions of the jth job

• N(J i ): The number of jobs in the set Ji

Evaluation: Scheduling Strategy

Total execution time for the 16 jobs without/with our scheduling


0.2

0.4

0.6

0.8

1.0

1.2

Norm

aliz

ed e

xec

uti

on

tim

e

Data sets

GridGraph-M-without GridGraph-M

Date post:	29-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

GraphM sc'19sc19.supercomputing.org/proceedings/tech_paper/... · GraphM: An Efficient Storage...

Documents