Post on 22-Sep-2020
transcript
NumaGiC: A garbage collector for big-data
on big NUMA machines
Lokesh Gidra‡, Gaël Thomas�, Julien Sopena‡,Marc Shapiro‡, Nhan Nguyen♀
‡ LIP6/UPMC-INRIA �Telecom SudParis ♀Chalmers University
Motivation
◼Data-intensive applications need large machines with plenty of cores and memory
Lokesh Gidra 2
Motivation
◼Data-intensive applications need large machines with plenty of cores and memory
◼But, for large heaps, GC is inefficient on such machines
Lokesh Gidra 3
GC
Thr
ough
put
(GB
col
lect
ed p
er se
cond
)
Baseline PS
# of cores
Ideal Scalability
Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine
Motivation
◼Data-intensive applications need large machines with plenty of cores and memory
◼But, for large heaps, GC is inefficient on such machines
Lokesh Gidra 4
GC
Thr
ough
put
(GB
col
lect
ed p
er se
cond
)Baseline PS
# of cores
Ideal Scalability
GC takes roughly 60% of the total time
Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine
Outline
◼Why GC doesn’t scale?
◼Our Solution: NumaGiC
◼Evaluation
Lokesh Gidra 5
GCs don’t scale because machines are NUMA
Hardware hides the distributed memory � application silently creates inter-node references
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
Lokesh Gidra 6
GCs don’t scale because machines are NUMA
But memory distribution is also hidden to the GC threadswhen they traverse the object graph
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
GC thread
Lokesh Gidra 7
GCs don’t scale because machines are NUMA
But memory distribution is also hidden to the GC threadswhen they traverse the object graph
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
GC thread
Lokesh Gidra 8
GCs don’t scale because machines are NUMA
A GC thread thus silently traverses remote references
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
GC thread
Lokesh Gidra 9
GCs don’t scale because machines are NUMA
A GC thread thus silently traverses remote referencesand continues its graph traversal on any node
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
GC thread
Lokesh Gidra 10
GCs don’t scale because machines are NUMA
A GC thread thus silently traverses remote referencesand continues its graph traversal on any node
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
GC thread
Lokesh Gidra 11
GCs don’t scale because machines are NUMA
When all GC threads access any memory nodes, the inter-connect potentially saturates
=> high memory access latency
Node 0 Node 1
Node 2 Node 3
Mem
ory
Mem
ory
Mem
ory
Mem
ory
Lokesh Gidra 12
Outline
◼Why GC doesn’t scale?
◼Our Solution: NumaGiC
◼Evaluation
Lokesh Gidra 13
How can we fix the memory locality issue?
Simply by preventing any remote memory access
Lokesh Gidra 14
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
Mem
ory
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Lokesh Gidra 15
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
Mem
ory
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Lokesh Gidra 16
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
Mem
ory
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Remote reference � sends it to its home-nodeLokesh Gidra 17
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
Remote reference � sends it to its home-node
Mem
ory
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Lokesh Gidra 18
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
And continue the graph traversal locally
Mem
ory
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Lokesh Gidra 19
Prevent remote access using messages
Enforces memory access localityby trading remote memory accesses by messages
And continue the graph traversal locallyM
emor
y
Mem
ory
Node 0 Node 1
Thread 0 Thread 1
Lokesh Gidra 20
Using messages enforces local access…
…but opens up other performance challenges
Lokesh Gidra 21
Problem1: a msg is costlier than a remote access
Node 0 Node 1
Too many messages
Lokesh Gidra 22
Inter-node messages must be minimized
Problem1: a msg is costlier than a remote access
Node 0 Node 1
Too many messages
Lokesh Gidra 23
Inter-node messages must be minimized
• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered
Node 0 Node 1
Problem1: a msg is costlier than a remote access
Node 0 Node 1
Too many messages
Lokesh Gidra 24
Inter-node messages must be minimized
• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered
Approach: let objects allocated by a thread stay on its node
Problem2: Limited parallelism
◼Due to serialized traversal of object clusters across nodesNode 0 Node 1
Node 1 idles while node 0 collects its memory
Lokesh Gidra 25
Problem2: Limited parallelism
◼Due to serialized traversal of object clusters across nodes
◼ Solution: adaptive algorithmTrade-off between locality and parallelism1. Prevent remote access by using messages when not idling2. Steal and access remote objects otherwise
Node 0 Node 1
Node 1 idles while node 0 collects its memory
Lokesh Gidra 26
Outline
◼Why GC doesn’t scale?
◼Our Solution: NumaGiC
◼Evaluation
Lokesh Gidra 27
Evaluation
◼Comparison of NumaGiC with –1. ParallelScavenge (PS): baseline stop-the-world GC of Hotspot2. Improved PS: PS with lock-free data structures and interleaved
heap space3. NAPS: Improved PS + slightly better locality, but no messages
◼Metrics• GC throughput –
– amount of live data collected per second (GB/s)– Higher is better
• Application performance –– Relative to improved PS– Higher is better
Lokesh Gidra 28
Name Description Heap SizeAmd48 Intel80
Spark In-memory dataanalytics(pagerankcomputation)
110to160GB
250to350GB
Neo4j Objectgraphdatabase(SingleSourceShortestPath)
110to160GB
250to350GB
SPECjbb2013 Business-logic server 24to40GB 24to40GB
SPECjbb2005 Business-logicserver 4 to8GB 8to 12GB
Experiments
1 billion edgeFriendster dataset
The 1.8 billion edgeFriendster dataset
Lokesh Gidra 29
Hardwaresettings–1. AMDMagny Cours with8nodes,48threads,256GBofRAM2. XeonE7-2860with4nodes,80threads,512GBofRAM
GC Throughput (GB collected per second)
GC
Thr
ough
put
Heap Sizes
Spark Neo4j SpecJBB13 SpecJBB05
onAmd48
Lokesh Gidra 30
Improved PS NAPS NumaGiC
GC Throughput (GB collected per second)
GC
Thr
ough
put
Heap Sizes
Spark Neo4j SpecJBB13 SpecJBB05
onAmd48
NumaGiC multiplies GC performance up to 5.4X
Lokesh Gidra 31
Improved PS NAPS NumaGiC
5.4X
2.9X
GC Throughput (GB collected per second)
Heap Sizes
Spark Neo4j SpecJBB13 SpecJBB05
onAmd48
onIntel80
Lokesh Gidra 32
GC
Thr
ough
put
3.6X
GC Throughput Scalability
Spark on Amd48 with asmaller dataset of 40GB
GC
Thr
ough
put
# of nodes
Improved PS
Baseline PS
NumaGiC
Ideal Scalability
Lokesh Gidra 33
NAPS
Application speedup
Spee
dup
rela
tive
to Im
prov
ed P
S
Lokesh Gidra 34
Spark Neo4j SpecJBB13 SpecJBB05
NAPS NumaGiC
94%82%
36%
64%55%
61%
27%42%
Application speedup
Spee
dup
rela
tive
to Im
prov
ed P
S
Lokesh Gidra 35
Spark Neo4j SpecJBB13 SpecJBB05
NAPS NumaGiC
12%
21%37%
35%
26%
37%
33%37%
Conclusion
◼ Performance of data-intensive apps relies on GC performance
◼Memory access locality has huge effect on GC performance
◼Enforcing locality can be detrimental for parallelism in GCs
◼ Future work: NUMA-aware concurrent GCs
Lokesh Gidra 36
Conclusion
◼ Performance of data-intensive apps relies on GC performance
◼Memory access locality has huge effect on GC performance
◼Enforcing locality can be detrimental for parallelism in GCs
◼ Future work: NUMA-aware concurrent GCs
Lokesh Gidra 37
Thank You J
Large multicores provide this power
But scalability is hard to achieve because software stack was not designed for
Data analytic
Cores Memory Banks
I/O controllers
Operating system
Application
Language runtime
Middleware
Hypervisor
Hadoop, Spark, Neo4j, Cassandra…
JVM, CLI, Python, R…
Linux, Windows…
Xen, VMWare…
Lokesh Gidra 38
Large multicores provide this power
But scalability is hard to achieve because software stack was not designed for
Data analytic
Cores Memory Banks
I/O controllers
Operating system
Application
Language runtime
Middleware
Hypervisor
Hadoop, Spark, Neo4j, Cassandra…
JVM, CLI, Python, R…
Linux, Windows…Do not consider hypervisors in this talk:
Software stack is already complexand hard to analyze!
Lokesh Gidra 39