SC 2019
GraphM: An Efficient Storage System for High Throughput of Concurrent Graph Processing
Jin Zhao1, Yu Zhang1, Xiaofei Liao1, Ligang He2, Bingsheng He3, Hai Jin1, Haikun Liu1, Yicheng Chen1
1 SCTS/CGCL, Huazhong University of Science and Technology, China
2 University of Warwick, UK3 National University of Singapore, Singapore
• Background and Challenges
• GraphM
• Experimental Results
• Conclusion
Outline
Social network Internet Road network
• Graph processing → rapidly growing demand in the real world
…
Graph Processing
• Graphs → ubiquitously preferred data representation
35
30
25
20
15
10
5
00 20 40 60 80 100 120 140 160N
um
ber
of
con
curr
ent
job
s
Times(hours)
Number of jobs traced on a social network
More than 30 jobs are
concurrently executed on the
same platform at the peak time
Concurrent Graph Processing
• Many concurrent graph processing jobs are often handled on the same underlying
graph (or its different snapshots) to provide various information for different products
K-means SSSP
…
PageRank
Graph Processing Framework
Shared
Graph Data
Concurrent iterative
graph processing jobs
ChaosPowerGraph GraphChi …GridGraph
Existing Graph Processing Systems
• Higher sequential memory bandwidth
• Better data locality
• Smaller memory consumption
• Fewer redundant data accesses
…
Mainly designed to efficiently handle single graph processing job
ChaosPowerGraph GraphChi …GridGraph
Existing Graph Processing Systems
• Higher sequential memory bandwidth
• Better data locality
• Smaller memory consumption
• Fewer redundant data accesses
…
Mainly designed to efficiently handle single graph processing job
Graph
copy
Graph
copy
Graph
copy
Memory
Cache
Job-specific data
Job-specificdata
Job-specific data
Job 1 Job 2 Job 3
Secondary storage
Graph
data
Execution of concurrent graph processing jobs on existing systems
Reason 1: A lot of redundant consumption of
memory resources and data access channels
0
2
4
6
8
1 2 4 8
Mem
ory
usa
ge
(GB
)
Number of concurrent jobs
PagerankWCCBFSSSSP
(a) Total memory usage
0
2
4
6
8
1 2 4 8
Mem
ory
usa
ge
(GB
)
Number of concurrent jobs
PagerankWCCBFSSSSP
(a) Total memory usage
0
20
40
60
80
100
120
1 2 4 8
LL
C m
isse
s (B
illi
ons)
Number of concurrent jobs
PageRankWCCBFSSSSP
(b) Total last-level cache misses
Mem
ory
usa
ge(G
B)
LL
C m
isse
s (B
illions)
Challenges: Redundant Data Access Overhead
The memory usage and the total amount of graph data loaded into LLC
increases when more concurrent jobs executed on the same platform
The LLC misses per instruction (LPI) and average execution time of each job
increases when more concurrent jobs executed on the same platform
Challenges: Redundant Data Access Overhead
0
100
200
300
400
500
600
1 2 4 8Number of concurrent jobs
PageRankWCCBFSSSSP
(d) Average execution time
0.006
0.007
0.008
0.009
0.010
1 2 4 8
Num
ber
of
LP
I
Number of concurrent jobs(c) Average number of LPI
PageRankWCCBFSSSSP
Ex
ecu
tio
n t
ime
(Sec
on
ds)
Nu
mb
er o
f L
PI
Reason 2: Serious contention for storage
resources and data access channels
Observations
5
6
7
8
9
1 2 3 4 5 6
Aver
age
dat
a ac
cess
tim
es
Time (hours)(b) Average data access times
60
70
80
90
100
1 2 3 4 5 6Time (hours)
(a) Percentage of shared graph
#>1 #>2 #>4 #>8
Perc
enta
ge
share
d b
y #
job
s (%
)A
ver
age d
ata a
cces
s ti
mes
Spatial Similarity
Most proportion of the same graph can be shared
by multiple concurrent jobs during the traversals
Temporal Similarity
Information traced on the social network
Observations
5
6
7
8
9
1 2 3 4 5 6
Aver
age
dat
a ac
cess
tim
es
Time (hours)(b) Average data access times
60
70
80
90
100
1 2 3 4 5 6Time (hours)
(a) Percentage of shared graph
#>1 #>2 #>4 #>8
Perc
enta
ge
share
d b
y #
job
s (%
)A
ver
age d
ata a
cces
s ti
mes
Spatial Similarity
Most proportion of the same graph can be shared
by multiple concurrent jobs during the traversals
Temporal Similarity
The same graph data may be repeatedly accessed
by different concurrent jobs over a period of time
Information traced on the social network
Motivations
Spatial Similarity
Maintain a single copy of the same graph structure data in the storage to
serve the concurrent jobs
➢ How to utilize the spatial/temporal similarities?
Temporal Similarity
Motivations
Spatial Similarity
Maintain a single copy of the same graph structure data in the storage to
serve the concurrent jobs
➢ How to utilize the spatial/temporal similarities?
Temporal Similarity
Consolidate the accesses to the same graph structure data for
concurrent jobs
Motivations
➢ How to utilize the spatial/temporal similarities in a practical way?
Option #1: Design a graph processing framework
• New programming model for graph algorithms
• Requires changes in user-level applications
Motivations
➢ How to utilize the spatial/temporal similarities in a practical way?
Option #2: Develop a graph storage system
• Several APIs for existing graph processing frameworks
• Can be transparent to application programmers
Option #1: Design a graph processing framework
• New programming model for graph algorithms
• Requires changes in user-level applications
• Background and Challenges
• GraphM
• Experimental Results
• Conclusion
Outline
GraphM Explores…
• Traditional graph storage approach
D1 = (V1, E1, W1, S1)
D2 = (V2, E2, W2, S2)
DJ = (VJ, EJ, WJ, SJ )
• Main goals
D1 = (V1, E1, W1, S1)
D2 = (V2, E2, W2, S2)
DJ = (VJ, EJ, WJ, SJ )
17
…
Most graph structure data G=(V, E, W) is the same for different
concurrent graph processing jobs
V, E, WThe storage of the same graph structure data and the data access
to it can be shared by different concurrent graph processing jobs
GraphM Explores…
• The challenges of utilizing the similarities
- Concurrent jobs access the shared partitions in an
individual manner along different graph paths
- The processing time of each graph structure partition
is various for different jobs
• Our expectations
- Load the shared graph partitions along a common
order for concurrent jobs
- Take into account the temporal similarities when
loading the shared graph partitions
P2 P3
P1 P2
P1 P2 P3 P4
Job 1
Job 2
Job 3
TimeAn Iteration of Graph Processing
Iteration n1 for Job 1
Iteration n2 for Job 2
Iteration n3 for Job 3
P4
P2 P3
P2 P1 P4
P1 P2 P3 P4
Job 1
Job 2
Job 3
TimeAn Iteration of Graph Processing
Iteration n1 for Job 1
Iteration n2 for Job 2
Iteration n3 for Job 3
Overview of GraphM
We design a structure-aware graph labelling scheme
We propose a Sharing-Synchronize mechanism
We develop a graph partition loading strategy
We enable GraphM without user-application change
Partition Labeling
Original graph
(v0, v1)
(v0, v2)
(v3, v1)
(v4, v2)
1
3
5
0
4
2P1
P2
P3
P4
(v0, v3)
(v0, v5)
(v2, v4)
(v2, v5)
(v5, v3)
(v5, v4)
Graph representation
format specific to GridGraph
Partition Labeling
• Graph partitions are traversed once and logically labeled as chunks
Original graph
(v0, v1)
(v0, v2)
(v3, v1)
(v4, v2)
1
3
5
0
4
2P1
P2
P3
P4
(v0, v3)
(v0, v5)
(v2, v4)
(v2, v5)
(v5, v3)
(v5, v4)
Each chunk consists of two edges
Graph representation
format specific to GridGraph
Partition Labeling
• Graph partitions are traversed once and logically labeled as chunks
Original graphGraph representation
format specific to GridGraph
(v0, v1)
(v0, v2)
(v3, v1)
(v4, v2)
1
3
5
0
4
2P1
P2
P3
P4
(v0, v3)
(v0, v5)
(v2, v4)
(v2, v5)
(v5, v3)
(v5, v4)
Each chunk consists of two edges
<v0, 2><v3, 1><v4, 1> <v0, 2> … Chunk tables
ID of source vertex
Number of it’s
outgoing edges
Chunk 1 Chunk 2 Chunk 3
Share-Synchronize Mechanism
G = {P1, P2, P3, P4}Disk
P3
S1 S2…
Load of graph structure data
S3
Memory
Suspend ones Executable ones
Job 1 Job 2 Job 3…
• Memory Sharing of Graph Structure (Sharing())
- Load an assigned active partition
- Resume or suspend corresponding jobs
- Share the graph structure data
Share-Synchronize Mechanism
G = {P1, P2, P3, P4}Disk
P3
S1 S2…
Load of graph structure data
S3
Memory
• The amount of edges that need to be
processed is different
• The computational complexity of the
edge processing function is different
e1P3
e2 e3 e4 Job 1
Job 3e1P3
e2 e3 e4
Suspend ones Executable ones
Job 1 Job 2 Job 3…
• Memory Sharing of Graph Structure (Sharing())
- Load an assigned active partition
- Resume or suspend corresponding jobs
- Share the graph structure data
Share-Synchronize Mechanism
• Memory Sharing of Graph Structure (Sharing())
- Load an assigned active partition
- Resume or suspend corresponding jobs
- Share the graph structure data
• Fine-grained Synchronization
- Profiling phase
- Obtain T(E), T(Fj)
- Syncing phase
G = {P1, P2, P3, P4}Disk
P3
S1 S2…
Load of graph structure data
S3
Memory
Suspend ones Executable ones
Job 1 Job 2 Job 3…
T(E): The average data access time for each edge
T(Fj):The computational complexity of the edge processing
function of job j
Share-Synchronize Mechanism
• Memory Sharing of Graph Structure (Sharing())
- Load an assigned active partition
- Resume or suspend corresponding jobs
- Share the graph structure data
• Fine-grained Synchronization
- Profiling phase
- Obtain T(E), T(Fj)
- Syncing phase
- Acquire workloads of each chunk
- Unevenly allocate CPU resources
G = {P1, P2, P3, P4}Disk
P3
S1 S2…
Load of graph structure data
S3
Memory
CacheChunk
Regular streaming of chunks
Job 1 Job 2 Job 3…
Suspend ones Executable ones
T(E): The average data access time for each edge
T(Fj):The computational complexity of the edge processing
function of job j
Ensuring of Consistent Snapshots
• The mutations (by some jobs) and updates (over time) of the shared graph structure
data are isolated among concurrent jobs to ensure the correctness of the processing
Phys ical address Virtual address
of Job 2
Shared memory
Mutation 2
Chunk 1
Virtual address of Job 1
copy
copy
low
high
Graph Structureof Job 1
Shared Graph Structure
Update 3
Graph Structureof Job 2
Copy 3 Chunk 4
Mutation 2
Chunk 4
Chunk 3Chunk 2
Chunk 1Chunk 4
Update 3
Chunk 2
Chunk 1
Job 1 is submitted before Job 2, Chunk 3 is updated after Job 1 is submitted, and Chunk 2 is modified by Job 2
Scheduling Strategy for Partition Loading
• The partitions are given a higher priority
- when they are handled by the jobs with fewer active partitions
- when they are processed by more jobs
Partition 1 is activated by the other partitions of job 1 in 𝑥𝑡ℎ iteration and can be handled at the (𝑥 + 1)𝑡ℎ
iteration for job 1
System Architecture
OS
Our Graph Storage System
Existing Graph Processing Engine
Graph API
Existing Graph Processing Framework
User
Application
User
Application
Original graph data
Chunk
tables
LLCChunk 1
Graph Partition
Chunk 1 Chunk 2
Chunk 3 Chunk 4
Graph Partition
Chunk 1 Chunk 2
Chunk 3 Chunk 4
Graph partition
Chunk 1 Chunk 2
Chunk 3 Chunk 4
Specific graph
representation
Chunk 1Chunk 1
CP
UD
isk
Mem
ory
Graph preprocessor
Graph sharing controller
Synchronization manager
User Application
Chunk
tables
User Application
Integrated with Existing Framework
No burden on programmers + Minor framework change
An example to illustrate how to integrate GraphM into existing graph processing framework
GraphM.Init() /*Initialization of GraphM*/StreamEdges(){ /*Setup the active partitions*/ GraphM.GetActiveVertices() for(each active partition){ partition GraphM.Sharing(G, load()) /*Notify GraphM to start synchronization*/ GraphM.Start() for(each edge partition) /*Process the streamed edges*/ /*Notify GraphM to end synchronization*/ GraphM.Barrier() }}
/*Edge streaming function in GridGraph*/
StreamEdges(){
/*Setup the active partitions*/
for(each active partition){
/* The original data load operation*/
partition load()
for(each edge partition)
/*Process the streamed edges*/
}
}
(a) Pseudocode of GridGraph (b) Pseudocode of GridGraph integrated with GraphM
• Background and Challenges
• GraphM
• Experimental Results
• Conclusion
Outline
• Machine information
- CPU: 2-way 8-core Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz
- each CPU has 20 MB Last-Level Cache
- Main Memory: 32GB
• Typical graph processing algorithms
- PageRank, WCC, BFS, SSSP
• Datasets
- 5 real world datasets
• Evaluated graph processing systems
- GraphChi, GridGraph, PowerGraph, Chaos
Datasets Vertices Edges Data sizes
LiveJ 4.8 M 69 M 526 MB
Orkut 3.1 M 117.2 M 894 MB
Twitter 41.7 M 1.5 B 10.9 GB
UK-union 133.6 M 5.5 B 40.1 GB
Clueweb12 978.4 M 42.6 B 317 GB
Experiment Setup
Properties of data sets
< 32 GB
> 32 GB
• Machine information
- CPU: 2-way 8-core Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz
- each CPU has 20 MB Last-Level Cache
- Main Memory: 32GB
• Typical graph processing algorithms
- PageRank, WCC, BFS, SSSP
• Datasets
- 5 real world datasets
• Evaluated graph processing systems
- GraphChi, GridGraph, PowerGraph, Chaos
Experiment Setup
GridGraph-M GridGraph-S GridGraph-C
Datasets Vertices Edges Data sizes
LiveJ 4.8 M 69 M 526 MB
Orkut 3.1 M 117.2 M 894 MB
Twitter 41.7 M 1.5 B 10.9 GB
UK-union 133.6 M 5.5 B 40.1 GB
Clueweb12 978.4 M 42.6 B 317 GB
Properties of data sets
< 32 GB
> 32 GB
Evaluation: Overall Performance
LiveJ Orkut Twitter UK-union Clueweb0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Norm
aliz
ed e
xec
uti
on t
ime
Data sets
GridGraph-S GridGraph-C GridGraph-M
LiveJ Orkut Twitter UK-union Clueweb0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
3.2
Gri
dG
raph
-MG
ridG
raph
-C
Tim
e co
nsu
mpti
on r
atio
Data sets
Graph processing time Data accessing time
Gri
dG
raph
-S
Gri
dG
raph
-S
Gri
dG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Total execution time for the 16 jobs with
different schemes
Execution time breakdown of jobs with
different schemes
• Shorter total execution time
• Much lower graph data accessing cost
Evaluation: Overall Performance
LiveJ Orkut Twitter UK-union Clueweb0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Norm
aliz
ed e
xec
uti
on t
ime
Data sets
GridGraph-S GridGraph-C GridGraph-M
LiveJ Orkut Twitter UK-union Clueweb0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
3.2
Gri
dG
raph
-MG
ridG
raph
-C
Tim
e co
nsu
mpti
on r
atio
Data sets
Graph processing time Data accessing time
Gri
dG
raph
-S
Gri
dG
raph
-S
Gri
dG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Gri
dG
raph
-SG
ridG
raph
-CG
ridG
raph
-M
Total execution time for the 16 jobs with
different schemes
Execution time breakdown of jobs with
different schemes
• Shorter total execution time
• Much lower graph data accessing cost
The data accessing time is
reduced by 11.48 times and
13.06 times
Evaluation: Volume of Data Access
Total I/O overhead for 16 jobs with
different schemes
Volume of data swapped into the LLC for
16 jobs with different schemes
LiveJ Orkut Twitter UK-union Clueweb0.0
0.2
0.4
0.6
0.8
1.0
1.2
Norm
aliz
ed V
olu
me
Data sets
GridGraph-S GridGraph-C GridGraph-M
LiveJ Orkut Twitter UK-union Clueweb0.0
0.2
0.4
0.6
0.8
1.0
1.2
No
rmal
ized
I/O
ov
erh
ead
Data sets
GridGraph-S GridGraph-C GridGraph-M
• Less volume of data access
• Reduce a mass of I/O overhead in the case of out-of-core processing
Evaluation: Scalability
Total execution time for different
number of jobs
Total execution time on different
number of CPU cores
1 2 4 8 160
100
200
300
400
500
Exec
uti
on t
ime
(hours
)
Number of jobs
GridGraph-S GridGraph-C GridGraph-M
1 2 4 8 160.0
0.2
0.4
0.6
0.8
Exec
uti
on t
ime
(hours
)
Number of CPU cores
GridGraph-S GridGraph-C GridGraph-M
• Better speedup is achieved when the number of jobs increases
• Simply adopting original frameworks to support concurrent jobs may be a terrible choice
Evaluation: Integration with Other Frameworks
Execution time (in seconds) of 64 jobs for different frameworks, where “—”
means it cannot be executed for memory errors. PowerGraph and Chaos are done
on a cluster with 128 nodes, which is connected via 1-Gigabit Ethernet.
LiveJ Orkut Twitter UK-union Clueweb12
GraphChi-S 2,348 2,248 43,032 149,352 > 1 week
GraphChi-C 776 696 10,580 38,760 > 1 week
GraphChi-M 344 468 6,128 12,436 248,840
PowerGraph-S 92 144 1,408 7,183 —
PowerGraph-C 83 111 1,153 6,653 —
PowerGraph-M 43 75 795 3,820 —
Chaos-S 224 159 4,668 29,538 487,272
Chaos-C 516 588 12,011 30,943 > 1 week
Chaos-M 121 106 2,261 10,614 156,881
• Pre-processing → format converting, graph partition labelling
• Result → small extra overhead
- Can be amortized by reuse
Evaluation: Pre-processing Cost
LiveJ Orkut Twitter UK-union Clueweb12
Extra Size 70.6 MB 49.2 MB 2.09 GB 4.5 GB 19.9 GB
Extra Ratio 13.4% 5.5% 19.2% 11.2% 6.3%
LiveJ Orkut Twitter UK-union Clueweb12
GridGraph 20.89 35.07 439.59 2,312.11 19,267.28
GridGraph-M 21.86 35.76 463.65 2,681.04 22,401.90
Extra Ratio 4.6% 2.0% 5.5% 16.0% 16.3%
Preprocessing time (in seconds)
Extra storage cost
• Background and Challenges
• GraphM
• Experimental Results
• Conclusion
Outline
Conclusion
➢What GraphM brings in graph processing
• Analysis of spatial/temporal similarities between concurrent graph processing jobs
• A novel Share-Synchronize mechanism for concurrent graph processing
• A scheduling strategy for out-of-core graph processing
• Requires no application change and only minor change in framework
➢ Future work
• How to exploit the use of new hardware (e.g., FPGA even ASIC) to accelerate data accesses
of concurrent jobs for higher throughput
• How to further optimize GraphM for distributed platforms and for evolving graphs processing
• How to further ensure the security
THANK YOU!
Service Computing Technology and System Lab., MoE (SCTS)
Cluster and Grid Computing Lab., Hubei Province (CGCL)
BACKUP SLIDES
Fine-grained Synchronization Execution
• Profiling Phase
• Syncing Phase
Ci: The set of chunks in the partition Pi
Vk: The set of vertices in the kth chunk
Aj: The set of active vertices for the jth job
: The number of out-going edges of the vertex v in the kth chunk
Priorities of Graph Partitions
• Ji: The set of jobs to handle Pi in the next iteration
• Nj (P): The number of active partitions of the jth job
• N(J i ): The number of jobs in the set Ji
Evaluation: Scheduling Strategy
Total execution time for the 16 jobs without/with our scheduling
LiveJ Orkut Twitter UK-union Clueweb0.0
0.2
0.4
0.6
0.8
1.0
1.2
Norm
aliz
ed e
xec
uti
on
tim
e
Data sets
GridGraph-M-without GridGraph-M