The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern...

transcript

Gil Tene

Balaji Iyengar

Michael Wolf

The C4 Collector

Or: the Application memory wall will remain until compaction is “solved”…

High Level Agenda

1. The Application “Memory Wall”

2. Generational collection for modern servers

3. C4 algorithm basics

4. Special generational considerations

5. Additional contributions

6. Results

The Application “Memory Wall”

Memory.

F How many of you use heap sizes of:

F more than ½ GB?

F more than 1 GB?

F more than 2 GB?

F more than 4 GB?

F more than 10 GB?

F more than 20 GB?

F more than 50 GB?

F more than 100 GB?

Reality check: memory in 2011

• Retail prices of common commodity servers (June, 2011)

• 24 vCore, 96GB server ~$6.5K

• 32 vCore, 256GB server ~$20K

• 64 vCore, 512GB server ~$35K

• 96 vCore, 1TB server ~$80K

• Cheap (<$2/GB/Month), and roughly linear to ~1TB

How much memory do applications need?

• “640K ought to be enough for anybody”

WRONG! So what’s the right number?

• 6,400K? (6.4MB)?

• 64,000K? (64MB)?

• 640,000K? (640MB)?

• 6,400,000K? (6.4GB)?

• 64,000,000K? (64GB)?

• There is no right number.

• Target moves at ~50x-100x per decade.

“Tiny” application history

100KB apps on a ¼ to ½ MB Server

10MB apps on a 32 – 64 MB server

1GB apps on a 2 – 4 GB server

??? GB apps on 256 GB

Moore’s Law:

If transistor counts grow at

• ~2x every 18 months

• ~100x every 10 yrs

Why is there an “application memory wall”?

• GC is a clear and dominant cause

• There seems to be a practical heap size limits for applications with responsiveness requirements

─ A 100GB heap won’t crash. It just periodically “pauses” for several minutes at a time.

• [Virtually] All current commercial JVMs will exhibit a multi-second pauses on a normally utilized 2-4GB heap.

─ It’s a question of “When” and “how often”, not “If”.

─ GC Tuning only moves the “when” and the “how often” around

─ [Inevitable] compaction dominates pauses

• The C4 collector is focused on removing the application memory wall in enterprise server environments

─ Focus: compaction that no longer hurts responsiveness

Generational Collection for modern servers…

• A modern server: 100s of GB of memory, 10s of cores, each allocating at ¼ - ½ GB/sec.

─ A practical collector must work within this envelope without incurring “significant” program pauses.

• Generational collection is a practical necessity for keeping up with modern processor throughputs

• The young generation is typically collected in a stop-the-world, full copying/promotion pass

─ But at modern server capacities, even the “young” generation would commonly hold live sets that are several GB in size

• Young generation collection needs to become either concurrent or incremental

A C4 invention:

A Concurrent Young Generation Collector

• Historically, ALL generational collectors have used stop-the-world, full cycle young gen collection

• C4 is the first known collector to use a non-stop-the-world young generation (first shipped in 2006)

• There is currently only one incremental young gen collector we are aware of: “Generational Real-Time GC: A Three-Part Invention for Young Objects. ECOOP ’07 [11]”

─ Motivated by similar wish to keep up with throughput

• There are currently no other concurrent young gen collectors that we are aware of

The C4 Collector

• Concurrent, compacting new generation

• Concurrent, compacting old generation

• Concurrent guaranteed-single-pass markers

─ Oblivious to mutation, insensitive to mutation rate

• Concurrent Compactors

─ Objects moved without stopping mutator

─ References remapped without stopping mutator

─ Can relocate entire generation (New, Old) in every GC cycle

• No stop-the-world fallback

─ Always compacts, and always does so concurrently

C4 Algorithm basics

C4 algorithm highlights

• Same core mechanism used for both generations ─ Concurrent Mark-Compact

• A Loaded Value Barrier (LVB) is central to the algorithm ─ Every heap reference is verified as “sane” when loaded

─ “Non-sane” refs are caught and fixed in a self-healing barrier

• Refs that have not yet been “marked through” are caught ─ Guaranteed single pass concurrent marker

• Refs that point to relocated objects are caught ─ Lazily (and concurrently) remap refs, no hurry

─ Relocation and remapping are both concurrent

• Uses “quick release” to recycle memory ─ Forwarding information is kept outside of object pages

─ Physical memory released immediately upon relocation

─ “Hand-over-hand” compaction without requiring empty memory

The C4 GC Cycle

Compact

Mark Phase

• Mark phase finds all live objects in the Java heap

• Concurrent & predictable: always completes in a single pass

• Uses LVB to defeat concurrent marking races

─ Tracks object references that have been traversed by using an

“NMT” (not marked through) metadata bit in each object reference

─ Any access to a not-yet-traversed reference will trigger the LVB

─ Triggered references are queued on collector work lists, and

reference NMT state is corrected

─ “Self healing” corrects the memory location that the reference was

loaded from

• Marker tracks the total live memory in each memory page

─ Compaction uses this to go after the sparse pages first

(But each cycle will tend to compact the entire heap…)

Relocate Phase

• Compacts to reclaim heap space occupied by dead objects in “from” pages without stopping mutator

• Protects “from” pages.

• Uses LVB to support concurrent relocation and lazy remapping by triggering on any access to references to “from” pages

• Relocates any live objects to newly allocated “to” pages

• Maintains forwarding pointers outside of “from” pages

• Virtual “from” space cannot be recycled until all references to relocated objects are remapped

• “Quick Release”: Physical memory can be immediately reclaimed, and used to feed further compaction or allocation

Remap Phase

• Scans all live objects in the heap

• Looks for references to previously relocated objects, and updates (“remaps”) them to point to the new object locations

• Uses LVB to support lazy remapping

─ Any access to a not-yet-remapped reference will trigger the LVB

─ Triggered references are corrected to point to the object’s new

location by consulting forwarding pointers

─ “Self healing” corrects the memory location the reference was

loaded from

• Overlaps with the next mark phase’s live object scan

─ Mark & Remap are executed as a single pass

The C4 GC Cycle

Compact

Special Generational considerations

• Multiple young-gen collections must be able to complete within a single Old-gen collection

─ Otherwise, the generational filter will be “missing”

• C4 runs both young and old generation concurrently ─ young gen effects are not atomic as seen by old gen, and vice-versa

• Old gen roots include “moving targets” in young gen ─ Every old gen cycle kicks off a “starting” young gen cycle

─ “starting” young gen cycle produces a young-to-old root set stream

─ Old gen marker concurrently consumes young-to-old root set stream

• Some additional synchronization needed

─ E.g. young gen may need an object size, located in an old-gen class

object that is being relocated

Additional Contributions

• Tiered Allocation Spaces ─ C4’s Concurrent compaction requires objects to not span relocation

page boundaries, leading to potential space waste

─ The presented tiered allocation spaces method can contain worst case space waste to an arbitrary level

─ Tiered allocation spaces also serve to contain the worst case latency a mutator will encounter when cooperatively relocating an object

• OS Kernel enhancements ─ C4’s page life cycle relies on OS virtual memory manipulation

─ Sustaining modern server throughput requires higher manipulation rate than most modern OSs can support (e.g. a high page unmap rate to match quick-release behavior)

─ We present new OS Kernel APIs that provide much higher throughput virtual memory manipulation

Results

• Focus: maintaining consistent response times while using multi-GB heaps and live sets, and sustaining multi-GB/sec allocation rates

• Surprisingly hard to find standard benchmarks that would both scale to modern server capacities and include long-running enterprise application behavior

• Workload: modified 4 warehouse SPECjbb2005 workload, changed to include a modest, 2GB object cache that churns at a slow (20MB/sec) rate, and to measure transaction response times

• Compared the observed worst case response times under different collectors

• Oh, and we ran all setups long enough to see an Old gen compaction event occur….

Sample throughput

• SpecJBB + Slow churning 2GB LRU Cache

• Live set is ~2.5GB across all measurements

• Allocation rate is ~1.2GB/sec across all measurements

Sample responsiveness improvement

• SpecJBB + Slow churning 2GB LRU Cache

• Live set is ~2.5GB across all measurements

• Allocation rate is ~1.2GB/sec across all measurements

Design goal: be insensitive…

• Heap Size

• Allocation rate

• Mutation rate

• Locality

• “non-generational” behavior

• …

Q & A The C4 Collector

Sustainable Remap Rates….

• Per 2MB of allocation: map… remap/protect… unmap…

• Need to keep up with sustained allocation rate ─ A modern x86 core will happily generate ~0.5GB/sec of garbage

• (m)remaping pages is only small part of GC cycle ─ Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle

─ So need to sustain 100s of GB/sec in mremap rate…

• Linux remaps sustain <1GB/sec

─ Dominated by unneeded semantics

─ TLB invalidates, 4KB mappings, global locking, …

• Enhanced kernel supports >6TB/sec sustained remap rates

─ Avoids in-process implicit TLB invalidates, uses 2MB mappings

Sustained Remap Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

1 3.04 GB/sec 6.50 TB/sec >2,000x

2 1.82 GB/sec 6.09 TB/sec >3,000x

4 1.19 GB/sec 6.08 TB/sec >5,000x

8 897.65 MB/sec 6.29 TB/sec >7,000x

12 736.65 MB/sec 6.39 TB/sec >8,000x

Remap Commit Rates….

• Remap/protection must be consistent across mutator threads

• Each “batch” of relocated pages needs synchronization ─ In practical terms, we bring mutators to safe point, and flip pages

• Using linux mremap(), protecting 16GB would take ~20 sec.

• Enhanced kernel supports >800TB/sec remap commit rates

─ Uses shadow table and batch remap/protect ops api

─ Accumulated batch operations are not visible until committed

─ Commits shadow table using ~1 pointer copy per GB

─ Protecting 16GB takes about ~22 usec…

Remap Commit Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x

1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x

8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x

12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x

* Commit rate and (time it would take to commit 16GB)

New programming models?

• The coherent, shared memory SMP model has endured

─ That’s how people program. Still...

• In the past 40 years, new programming models proposed

─ Whenever we run into a new “architectural limit”

─ Usually involve some sort of “loosely coupled memory”

─ New models are generally useful for “mega-scale” (moving target)

─ They don’t survive (for long) within a physical machine…

• 64KB not enough? (early 1980s)

─ 20 bit segmented memory for 16 bit processors (birth of x86)

• 640KB not enough? (early 1990s)

─ 32 bit operating systems, even in the “commodity/desktop” world

The “hard” things to do in GC

• Robust concurrent marking ─ Refs keep changing

─ Multi-pass marking sensitive to mutation rate

─ Weak, Soft, Final references “hard” to deal with concurrently

• [Concurrent] Compaction… ─ It’s not the moving of the objects…

─ It’s the fixing of all those references that point to them

─ How do you deal with a mutator looking at a stale reference?

─ If you can’t, then remapping is a STW operation

• Without solving Compaction, GC won’t be solved ─ All current commercial server JVMs and GCs perform compaction

─ Azul ships the only commercial JVMs that concurrently compact

Garbage Collection & Compaction You can delay it, but you cannot get rid of it

• Compaction is inevitable ─ And compacting anything requires scanning/fixing all references

─ Usually the worst possible thing that can happen in GC

• You can delay compaction, but not get rid of it

• Delay tactics focus on getting “easy empty space” first ─ This is the focus for the vast majority of GC tuning

• Most objects die young ─ So collect young objects only, as much as possible

─ But eventually, some old dead objects must be reclaimed

• Most old dead space can be reclaimed without moving it ─ So track dead space in lists, and reuse it in place

─ But eventually, space gets fragmented, and needs to be moved

• Eventually, all collectors compact the heap

HotSpot™ CMS Collector mechanism classification

• Stop-the-world compacting new gen (ParNew)

• Mostly Concurrent, non-compacting old gen (CMS)

─ Mostly Concurrent marking

─ Mark concurrently while mutator is running

─ Track mutations in card marks

─ Revisit mutated cards (repeat as needed)

─ Stop-the-world to catch up on mutations, ref processing, etc.

─ Concurrent Sweeping

─ Does not Compact (maintains free list, does not move objects)

• Fallback to Full Collection (Stop the world).

─ Used for Compaction, etc.

HotSpot “Garbage First” (aka G1) Collector mechanism classification

• Experimental

-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

• Stop-the-world compacting new gen

• Mostly Concurrent, old gen marker

─Mostly Concurrent marking

─Tracks inter-region relationships in remembered sets

• Stop-the-world incremental compacting old gen

─Objective: “Avoid, as much as possible, having a Full GC…”

─Compact sets of regions that can be scanned in limited time

─Delay compaction of popular objects, popular regions

• Fallback to Full Collection (Stop the world)

─Used for compacting popular objects, popular regions

The C4 Collector - Azul Systems · ©2011 Azul Systems, Inc. 99 Generational Collection for modern...

Documents