NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for...

transcript

NumaGiC: A garbage collector for big-data

on big NUMA machines

Lokesh Gidra‡, Gaël Thomas�, Julien Sopena‡,Marc Shapiro‡, Nhan Nguyen♀

‡ LIP6/UPMC-INRIA �Telecom SudParis ♀Chalmers University

Motivation

◼Data-intensive applications need large machines with plenty of cores and memory

Lokesh Gidra 2

Motivation

◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 3

Baseline PS

# of cores

Ideal Scalability

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Motivation

◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 4

)Baseline PS

# of cores

Ideal Scalability

GC takes roughly 60% of the total time

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Outline

◼Why GC doesn’t scale?

◼Our Solution: NumaGiC

◼Evaluation

Lokesh Gidra 5

GCs don’t scale because machines are NUMA

Hardware hides the distributed memory � application silently creates inter-node references

Node 0 Node 1

Node 2 Node 3

Lokesh Gidra 6

But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

GC thread

Lokesh Gidra 7

But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

GC thread

Lokesh Gidra 8

A GC thread thus silently traverses remote references

Node 0 Node 1

Node 2 Node 3

GC thread

Lokesh Gidra 9

A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

GC thread

Lokesh Gidra 10

A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

GC thread

Lokesh Gidra 11

When all GC threads access any memory nodes, the inter-connect potentially saturates

=> high memory access latency

Node 0 Node 1

Node 2 Node 3

Lokesh Gidra 12

Outline

◼Evaluation

Lokesh Gidra 13

How can we fix the memory locality issue?

Simply by preventing any remote memory access

Lokesh Gidra 14

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 15

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 16

Node 0 Node 1

Thread 0 Thread 1

Remote reference � sends it to its home-nodeLokesh Gidra 17

Remote reference � sends it to its home-node

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 18

And continue the graph traversal locally

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 19

And continue the graph traversal locallyM

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 20

Using messages enforces local access…

…but opens up other performance challenges

Lokesh Gidra 21

Problem1: a msg is costlier than a remote access

Node 0 Node 1

Too many messages

Lokesh Gidra 22

Inter-node messages must be minimized

Node 0 Node 1

Too many messages

Lokesh Gidra 23

• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Node 0 Node 1

Too many messages

Lokesh Gidra 24

• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Approach: let objects allocated by a thread stay on its node

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodesNode 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 25

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodes

◼ Solution: adaptive algorithmTrade-off between locality and parallelism1. Prevent remote access by using messages when not idling2. Steal and access remote objects otherwise

Node 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 26

Outline

◼Evaluation

Lokesh Gidra 27

Evaluation

◼Comparison of NumaGiC with –1. ParallelScavenge (PS): baseline stop-the-world GC of Hotspot2. Improved PS: PS with lock-free data structures and interleaved

heap space3. NAPS: Improved PS + slightly better locality, but no messages

◼Metrics• GC throughput –

– amount of live data collected per second (GB/s)– Higher is better

• Application performance –– Relative to improved PS– Higher is better

Lokesh Gidra 28

Name Description Heap SizeAmd48 Intel80

Spark In-memory dataanalytics(pagerankcomputation)

110to160GB

250to350GB

Neo4j Objectgraphdatabase(SingleSourceShortestPath)

110to160GB

250to350GB

SPECjbb2013 Business-logic server 24to40GB 24to40GB

SPECjbb2005 Business-logicserver 4 to8GB 8to 12GB

Experiments

1 billion edgeFriendster dataset

The 1.8 billion edgeFriendster dataset

Lokesh Gidra 29

Hardwaresettings–1. AMDMagny Cours with8nodes,48threads,256GBofRAM2. XeonE7-2860with4nodes,80threads,512GBofRAM

GC Throughput (GB collected per second)

Heap Sizes

Spark Neo4j SpecJBB13 SpecJBB05

onAmd48

Lokesh Gidra 30

Improved PS NAPS NumaGiC

Heap Sizes

onAmd48

NumaGiC multiplies GC performance up to 5.4X

Lokesh Gidra 31

Improved PS NAPS NumaGiC

Heap Sizes

onAmd48

onIntel80

Lokesh Gidra 32

GC Throughput Scalability

Spark on Amd48 with asmaller dataset of 40GB

# of nodes

Improved PS

Baseline PS

NumaGiC

Ideal Scalability

Lokesh Gidra 33

Application speedup

Lokesh Gidra 34

NAPS NumaGiC

94%82%

64%55%

27%42%

Application speedup

Lokesh Gidra 35

NAPS NumaGiC

21%37%

33%37%

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 36

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 37

Thank You J

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…

Xen, VMWare…

Lokesh Gidra 38

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…Do not consider hypervisors in this talk:

Software stack is already complexand hard to analyze!

Lokesh Gidra 39

NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for...

Documents