+ All Categories
Home > Documents > The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern...

The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern...

Date post: 19-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
34
Gil Tene Balaji Iyengar Michael Wolf The C4 Collector Or: the Application memory wall will remain until compaction is solved”…
Transcript
Page 1: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

Gil Tene

Balaji Iyengar

Michael Wolf

The C4 Collector

Or: the Application memory wall will remain until compaction is “solved”…

Page 2: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 2 2

High Level Agenda

1. The Application “Memory Wall”

2. Generational collection for modern servers

3. C4 algorithm basics

4. Special generational considerations

5. Additional contributions

6. Results

Page 3: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 3

The Application “Memory Wall”

Page 4: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 4 4

Memory.

F How many of you use heap sizes of:

F more than ½ GB?

F more than 1 GB?

F more than 2 GB?

F more than 4 GB?

F more than 10 GB?

F more than 20 GB?

F more than 50 GB?

F more than 100 GB?

Page 5: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 5 5

Reality check: memory in 2011

• Retail prices of common commodity servers (June, 2011)

• 24 vCore, 96GB server ~$6.5K

• 32 vCore, 256GB server ~$20K

• 64 vCore, 512GB server ~$35K

• 96 vCore, 1TB server ~$80K

• Cheap (<$2/GB/Month), and roughly linear to ~1TB

Page 6: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 6 6

How much memory do applications need?

• “640K ought to be enough for anybody”

WRONG! So what’s the right number?

• 6,400K? (6.4MB)?

• 64,000K? (64MB)?

• 640,000K? (640MB)?

• 6,400,000K? (6.4GB)?

• 64,000,000K? (64GB)?

• There is no right number.

• Target moves at ~50x-100x per decade.

Page 7: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 7 7

“Tiny” application history

1980

1990

2000

2010

100KB apps on a ¼ to ½ MB Server

10MB apps on a 32 – 64 MB server

1GB apps on a 2 – 4 GB server

??? GB apps on 256 GB

Moore’s Law:

If transistor counts grow at

• ~2x every 18 months

• ~100x every 10 yrs

Page 8: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 8 8

Why is there an “application memory wall”?

• GC is a clear and dominant cause

• There seems to be a practical heap size limits for applications with responsiveness requirements

─ A 100GB heap won’t crash. It just periodically “pauses” for several minutes at a time.

• [Virtually] All current commercial JVMs will exhibit a multi-second pauses on a normally utilized 2-4GB heap.

─ It’s a question of “When” and “how often”, not “If”.

─ GC Tuning only moves the “when” and the “how often” around

─ [Inevitable] compaction dominates pauses

• The C4 collector is focused on removing the application memory wall in enterprise server environments

─ Focus: compaction that no longer hurts responsiveness

Page 9: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 9 9

Generational Collection for modern servers…

• A modern server: 100s of GB of memory, 10s of cores, each allocating at ¼ - ½ GB/sec.

─ A practical collector must work within this envelope without incurring “significant” program pauses.

• Generational collection is a practical necessity for keeping up with modern processor throughputs

• The young generation is typically collected in a stop-the-world, full copying/promotion pass

─ But at modern server capacities, even the “young” generation would commonly hold live sets that are several GB in size

• Young generation collection needs to become either concurrent or incremental

Page 10: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 10 10

A C4 invention:

A Concurrent Young Generation Collector

• Historically, ALL generational collectors have used stop-the-world, full cycle young gen collection

• C4 is the first known collector to use a non-stop-the-world young generation (first shipped in 2006)

• There is currently only one incremental young gen collector we are aware of: “Generational Real-Time GC: A Three-Part Invention for Young Objects. ECOOP ’07 [11]”

─ Motivated by similar wish to keep up with throughput

• There are currently no other concurrent young gen collectors that we are aware of

Page 11: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 11 11

The C4 Collector

• Concurrent, compacting new generation

• Concurrent, compacting old generation

• Concurrent guaranteed-single-pass markers

─ Oblivious to mutation, insensitive to mutation rate

• Concurrent Compactors

─ Objects moved without stopping mutator

─ References remapped without stopping mutator

─ Can relocate entire generation (New, Old) in every GC cycle

• No stop-the-world fallback

─ Always compacts, and always does so concurrently

Page 12: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 12

C4 Algorithm basics

Page 13: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 13 13

C4 algorithm highlights

• Same core mechanism used for both generations ─ Concurrent Mark-Compact

• A Loaded Value Barrier (LVB) is central to the algorithm ─ Every heap reference is verified as “sane” when loaded

─ “Non-sane” refs are caught and fixed in a self-healing barrier

• Refs that have not yet been “marked through” are caught ─ Guaranteed single pass concurrent marker

• Refs that point to relocated objects are caught ─ Lazily (and concurrently) remap refs, no hurry

─ Relocation and remapping are both concurrent

• Uses “quick release” to recycle memory ─ Forwarding information is kept outside of object pages

─ Physical memory released immediately upon relocation

─ “Hand-over-hand” compaction without requiring empty memory

Page 14: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 14 14

The C4 GC Cycle

Compact

Page 15: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 15 15

Mark Phase

• Mark phase finds all live objects in the Java heap

• Concurrent & predictable: always completes in a single pass

• Uses LVB to defeat concurrent marking races

─ Tracks object references that have been traversed by using an

“NMT” (not marked through) metadata bit in each object reference

─ Any access to a not-yet-traversed reference will trigger the LVB

─ Triggered references are queued on collector work lists, and

reference NMT state is corrected

─ “Self healing” corrects the memory location that the reference was

loaded from

• Marker tracks the total live memory in each memory page

─ Compaction uses this to go after the sparse pages first

(But each cycle will tend to compact the entire heap…)

Page 16: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 16 16

Relocate Phase

• Compacts to reclaim heap space occupied by dead objects in “from” pages without stopping mutator

• Protects “from” pages.

• Uses LVB to support concurrent relocation and lazy remapping by triggering on any access to references to “from” pages

• Relocates any live objects to newly allocated “to” pages

• Maintains forwarding pointers outside of “from” pages

• Virtual “from” space cannot be recycled until all references to relocated objects are remapped

• “Quick Release”: Physical memory can be immediately reclaimed, and used to feed further compaction or allocation

Page 17: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 17 17

Remap Phase

• Scans all live objects in the heap

• Looks for references to previously relocated objects, and updates (“remaps”) them to point to the new object locations

• Uses LVB to support lazy remapping

─ Any access to a not-yet-remapped reference will trigger the LVB

─ Triggered references are corrected to point to the object’s new

location by consulting forwarding pointers

─ “Self healing” corrects the memory location the reference was

loaded from

• Overlaps with the next mark phase’s live object scan

─ Mark & Remap are executed as a single pass

Page 18: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 18 18

The C4 GC Cycle

Compact

Page 19: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 19 19

Special Generational considerations

• Multiple young-gen collections must be able to complete within a single Old-gen collection

─ Otherwise, the generational filter will be “missing”

• C4 runs both young and old generation concurrently ─ young gen effects are not atomic as seen by old gen, and vice-versa

• Old gen roots include “moving targets” in young gen ─ Every old gen cycle kicks off a “starting” young gen cycle

─ “starting” young gen cycle produces a young-to-old root set stream

─ Old gen marker concurrently consumes young-to-old root set stream

• Some additional synchronization needed

─ E.g. young gen may need an object size, located in an old-gen class

object that is being relocated

Page 20: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 20 20

Additional Contributions

• Tiered Allocation Spaces ─ C4’s Concurrent compaction requires objects to not span relocation

page boundaries, leading to potential space waste

─ The presented tiered allocation spaces method can contain worst case space waste to an arbitrary level

─ Tiered allocation spaces also serve to contain the worst case latency a mutator will encounter when cooperatively relocating an object

• OS Kernel enhancements ─ C4’s page life cycle relies on OS virtual memory manipulation

─ Sustaining modern server throughput requires higher manipulation rate than most modern OSs can support (e.g. a high page unmap rate to match quick-release behavior)

─ We present new OS Kernel APIs that provide much higher throughput virtual memory manipulation

Page 21: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 21 21

Results

• Focus: maintaining consistent response times while using multi-GB heaps and live sets, and sustaining multi-GB/sec allocation rates

• Surprisingly hard to find standard benchmarks that would both scale to modern server capacities and include long-running enterprise application behavior

• Workload: modified 4 warehouse SPECjbb2005 workload, changed to include a modest, 2GB object cache that churns at a slow (20MB/sec) rate, and to measure transaction response times

• Compared the observed worst case response times under different collectors

• Oh, and we ran all setups long enough to see an Old gen compaction event occur….

Page 22: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 22 22

Sample throughput

• SpecJBB + Slow churning 2GB LRU Cache

• Live set is ~2.5GB across all measurements

• Allocation rate is ~1.2GB/sec across all measurements

Page 23: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 23 23

Sample responsiveness improvement

• SpecJBB + Slow churning 2GB LRU Cache

• Live set is ~2.5GB across all measurements

• Allocation rate is ~1.2GB/sec across all measurements

Page 24: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 24 24

Design goal: be insensitive…

• Heap Size

• Allocation rate

• Mutation rate

• Locality

• “non-generational” behavior

• …

Page 25: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

Q & A The C4 Collector

Page 26: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 26 26

Sustainable Remap Rates….

• Per 2MB of allocation: map… remap/protect… unmap…

• Need to keep up with sustained allocation rate ─ A modern x86 core will happily generate ~0.5GB/sec of garbage

• (m)remaping pages is only small part of GC cycle ─ Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle

─ So need to sustain 100s of GB/sec in mremap rate…

• Linux remaps sustain <1GB/sec

─ Dominated by unneeded semantics

─ TLB invalidates, 4KB mappings, global locking, …

• Enhanced kernel supports >6TB/sec sustained remap rates

─ Avoids in-process implicit TLB invalidates, uses 2MB mappings

Page 27: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 27 27

Sustained Remap Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

1 3.04 GB/sec 6.50 TB/sec >2,000x

2 1.82 GB/sec 6.09 TB/sec >3,000x

4 1.19 GB/sec 6.08 TB/sec >5,000x

8 897.65 MB/sec 6.29 TB/sec >7,000x

12 736.65 MB/sec 6.39 TB/sec >8,000x

Page 28: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 28 28

Remap Commit Rates….

• Remap/protection must be consistent across mutator threads

• Each “batch” of relocated pages needs synchronization ─ In practical terms, we bring mutators to safe point, and flip pages

• Using linux mremap(), protecting 16GB would take ~20 sec.

• Enhanced kernel supports >800TB/sec remap commit rates

─ Uses shadow table and batch remap/protect ops api

─ Accumulated batch operations are not visible until committed

─ Commits shadow table using ~1 pointer copy per GB

─ Protecting 16GB takes about ~22 usec…

Page 29: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 29 29

Remap Commit Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x

1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x

2 1.82 GB/sec (8 sec) 1166.04 TB/sec (14 usec) >640,000x

4 1.19 GB/sec (13 sec) 913.74 TB/sec (18 usec) >750,000x

8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x

12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x

* Commit rate and (time it would take to commit 16GB)

Page 30: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 30 30

New programming models?

• The coherent, shared memory SMP model has endured

─ That’s how people program. Still...

• In the past 40 years, new programming models proposed

─ Whenever we run into a new “architectural limit”

─ Usually involve some sort of “loosely coupled memory”

─ New models are generally useful for “mega-scale” (moving target)

─ They don’t survive (for long) within a physical machine…

• 64KB not enough? (early 1980s)

─ 20 bit segmented memory for 16 bit processors (birth of x86)

• 640KB not enough? (early 1990s)

─ 32 bit operating systems, even in the “commodity/desktop” world

Page 31: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 31 31

The “hard” things to do in GC

• Robust concurrent marking ─ Refs keep changing

─ Multi-pass marking sensitive to mutation rate

─ Weak, Soft, Final references “hard” to deal with concurrently

• [Concurrent] Compaction… ─ It’s not the moving of the objects…

─ It’s the fixing of all those references that point to them

─ How do you deal with a mutator looking at a stale reference?

─ If you can’t, then remapping is a STW operation

• Without solving Compaction, GC won’t be solved ─ All current commercial server JVMs and GCs perform compaction

─ Azul ships the only commercial JVMs that concurrently compact

Page 32: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 32 32

Garbage Collection & Compaction You can delay it, but you cannot get rid of it

• Compaction is inevitable ─ And compacting anything requires scanning/fixing all references

─ Usually the worst possible thing that can happen in GC

• You can delay compaction, but not get rid of it

• Delay tactics focus on getting “easy empty space” first ─ This is the focus for the vast majority of GC tuning

• Most objects die young ─ So collect young objects only, as much as possible

─ But eventually, some old dead objects must be reclaimed

• Most old dead space can be reclaimed without moving it ─ So track dead space in lists, and reuse it in place

─ But eventually, space gets fragmented, and needs to be moved

• Eventually, all collectors compact the heap

Page 33: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 33 33

HotSpot™ CMS Collector mechanism classification

• Stop-the-world compacting new gen (ParNew)

• Mostly Concurrent, non-compacting old gen (CMS)

─ Mostly Concurrent marking

─ Mark concurrently while mutator is running

─ Track mutations in card marks

─ Revisit mutated cards (repeat as needed)

─ Stop-the-world to catch up on mutations, ref processing, etc.

─ Concurrent Sweeping

─ Does not Compact (maintains free list, does not move objects)

• Fallback to Full Collection (Stop the world).

─ Used for Compaction, etc.

Page 34: The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern servers… •A modern server: 100s of GB of memory, 10s of cores, each allocating

©2011 Azul Systems, Inc. 34 34

HotSpot “Garbage First” (aka G1) Collector mechanism classification

• Experimental

-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

• Stop-the-world compacting new gen

• Mostly Concurrent, old gen marker

─Mostly Concurrent marking

─Tracks inter-region relationships in remembered sets

• Stop-the-world incremental compacting old gen

─Objective: “Avoid, as much as possible, having a Full GC…”

─Compact sets of regions that can be scanned in limited time

─Delay compaction of popular objects, popular regions

• Fallback to Full Collection (Stop the world)

─Used for compacting popular objects, popular regions


Recommended