Post on 19-Mar-2020
transcript
Gil Tene
Balaji Iyengar
Michael Wolf
The C4 Collector
Or: the Application memory wall will remain until compaction is “solved”…
©2011 Azul Systems, Inc. 2 2
High Level Agenda
1. The Application “Memory Wall”
2. Generational collection for modern servers
3. C4 algorithm basics
4. Special generational considerations
5. Additional contributions
6. Results
©2011 Azul Systems, Inc. 3
The Application “Memory Wall”
©2011 Azul Systems, Inc. 4 4
Memory.
F How many of you use heap sizes of:
F more than ½ GB?
F more than 1 GB?
F more than 2 GB?
F more than 4 GB?
F more than 10 GB?
F more than 20 GB?
F more than 50 GB?
F more than 100 GB?
©2011 Azul Systems, Inc. 5 5
Reality check: memory in 2011
• Retail prices of common commodity servers (June, 2011)
• 24 vCore, 96GB server ~$6.5K
• 32 vCore, 256GB server ~$20K
• 64 vCore, 512GB server ~$35K
• 96 vCore, 1TB server ~$80K
• Cheap (<$2/GB/Month), and roughly linear to ~1TB
©2011 Azul Systems, Inc. 6 6
How much memory do applications need?
• “640K ought to be enough for anybody”
WRONG! So what’s the right number?
• 6,400K? (6.4MB)?
• 64,000K? (64MB)?
• 640,000K? (640MB)?
• 6,400,000K? (6.4GB)?
• 64,000,000K? (64GB)?
• There is no right number.
• Target moves at ~50x-100x per decade.
©2011 Azul Systems, Inc. 7 7
“Tiny” application history
1980
1990
2000
2010
100KB apps on a ¼ to ½ MB Server
10MB apps on a 32 – 64 MB server
1GB apps on a 2 – 4 GB server
??? GB apps on 256 GB
Moore’s Law:
If transistor counts grow at
• ~2x every 18 months
• ~100x every 10 yrs
©2011 Azul Systems, Inc. 8 8
Why is there an “application memory wall”?
• GC is a clear and dominant cause
• There seems to be a practical heap size limits for applications with responsiveness requirements
─ A 100GB heap won’t crash. It just periodically “pauses” for several minutes at a time.
• [Virtually] All current commercial JVMs will exhibit a multi-second pauses on a normally utilized 2-4GB heap.
─ It’s a question of “When” and “how often”, not “If”.
─ GC Tuning only moves the “when” and the “how often” around
─ [Inevitable] compaction dominates pauses
• The C4 collector is focused on removing the application memory wall in enterprise server environments
─ Focus: compaction that no longer hurts responsiveness
©2011 Azul Systems, Inc. 9 9
Generational Collection for modern servers…
• A modern server: 100s of GB of memory, 10s of cores, each allocating at ¼ - ½ GB/sec.
─ A practical collector must work within this envelope without incurring “significant” program pauses.
• Generational collection is a practical necessity for keeping up with modern processor throughputs
• The young generation is typically collected in a stop-the-world, full copying/promotion pass
─ But at modern server capacities, even the “young” generation would commonly hold live sets that are several GB in size
• Young generation collection needs to become either concurrent or incremental
©2011 Azul Systems, Inc. 10 10
A C4 invention:
A Concurrent Young Generation Collector
• Historically, ALL generational collectors have used stop-the-world, full cycle young gen collection
• C4 is the first known collector to use a non-stop-the-world young generation (first shipped in 2006)
• There is currently only one incremental young gen collector we are aware of: “Generational Real-Time GC: A Three-Part Invention for Young Objects. ECOOP ’07 [11]”
─ Motivated by similar wish to keep up with throughput
• There are currently no other concurrent young gen collectors that we are aware of
©2011 Azul Systems, Inc. 11 11
The C4 Collector
• Concurrent, compacting new generation
• Concurrent, compacting old generation
• Concurrent guaranteed-single-pass markers
─ Oblivious to mutation, insensitive to mutation rate
• Concurrent Compactors
─ Objects moved without stopping mutator
─ References remapped without stopping mutator
─ Can relocate entire generation (New, Old) in every GC cycle
• No stop-the-world fallback
─ Always compacts, and always does so concurrently
©2011 Azul Systems, Inc. 12
C4 Algorithm basics
©2011 Azul Systems, Inc. 13 13
C4 algorithm highlights
• Same core mechanism used for both generations ─ Concurrent Mark-Compact
• A Loaded Value Barrier (LVB) is central to the algorithm ─ Every heap reference is verified as “sane” when loaded
─ “Non-sane” refs are caught and fixed in a self-healing barrier
• Refs that have not yet been “marked through” are caught ─ Guaranteed single pass concurrent marker
• Refs that point to relocated objects are caught ─ Lazily (and concurrently) remap refs, no hurry
─ Relocation and remapping are both concurrent
• Uses “quick release” to recycle memory ─ Forwarding information is kept outside of object pages
─ Physical memory released immediately upon relocation
─ “Hand-over-hand” compaction without requiring empty memory
©2011 Azul Systems, Inc. 14 14
The C4 GC Cycle
Compact
©2011 Azul Systems, Inc. 15 15
Mark Phase
• Mark phase finds all live objects in the Java heap
• Concurrent & predictable: always completes in a single pass
• Uses LVB to defeat concurrent marking races
─ Tracks object references that have been traversed by using an
“NMT” (not marked through) metadata bit in each object reference
─ Any access to a not-yet-traversed reference will trigger the LVB
─ Triggered references are queued on collector work lists, and
reference NMT state is corrected
─ “Self healing” corrects the memory location that the reference was
loaded from
• Marker tracks the total live memory in each memory page
─ Compaction uses this to go after the sparse pages first
(But each cycle will tend to compact the entire heap…)
©2011 Azul Systems, Inc. 16 16
Relocate Phase
• Compacts to reclaim heap space occupied by dead objects in “from” pages without stopping mutator
• Protects “from” pages.
• Uses LVB to support concurrent relocation and lazy remapping by triggering on any access to references to “from” pages
• Relocates any live objects to newly allocated “to” pages
• Maintains forwarding pointers outside of “from” pages
• Virtual “from” space cannot be recycled until all references to relocated objects are remapped
• “Quick Release”: Physical memory can be immediately reclaimed, and used to feed further compaction or allocation
©2011 Azul Systems, Inc. 17 17
Remap Phase
• Scans all live objects in the heap
• Looks for references to previously relocated objects, and updates (“remaps”) them to point to the new object locations
• Uses LVB to support lazy remapping
─ Any access to a not-yet-remapped reference will trigger the LVB
─ Triggered references are corrected to point to the object’s new
location by consulting forwarding pointers
─ “Self healing” corrects the memory location the reference was
loaded from
• Overlaps with the next mark phase’s live object scan
─ Mark & Remap are executed as a single pass
©2011 Azul Systems, Inc. 18 18
The C4 GC Cycle
Compact
©2011 Azul Systems, Inc. 19 19
Special Generational considerations
• Multiple young-gen collections must be able to complete within a single Old-gen collection
─ Otherwise, the generational filter will be “missing”
• C4 runs both young and old generation concurrently ─ young gen effects are not atomic as seen by old gen, and vice-versa
• Old gen roots include “moving targets” in young gen ─ Every old gen cycle kicks off a “starting” young gen cycle
─ “starting” young gen cycle produces a young-to-old root set stream
─ Old gen marker concurrently consumes young-to-old root set stream
• Some additional synchronization needed
─ E.g. young gen may need an object size, located in an old-gen class
object that is being relocated
©2011 Azul Systems, Inc. 20 20
Additional Contributions
• Tiered Allocation Spaces ─ C4’s Concurrent compaction requires objects to not span relocation
page boundaries, leading to potential space waste
─ The presented tiered allocation spaces method can contain worst case space waste to an arbitrary level
─ Tiered allocation spaces also serve to contain the worst case latency a mutator will encounter when cooperatively relocating an object
• OS Kernel enhancements ─ C4’s page life cycle relies on OS virtual memory manipulation
─ Sustaining modern server throughput requires higher manipulation rate than most modern OSs can support (e.g. a high page unmap rate to match quick-release behavior)
─ We present new OS Kernel APIs that provide much higher throughput virtual memory manipulation
©2011 Azul Systems, Inc. 21 21
Results
• Focus: maintaining consistent response times while using multi-GB heaps and live sets, and sustaining multi-GB/sec allocation rates
• Surprisingly hard to find standard benchmarks that would both scale to modern server capacities and include long-running enterprise application behavior
• Workload: modified 4 warehouse SPECjbb2005 workload, changed to include a modest, 2GB object cache that churns at a slow (20MB/sec) rate, and to measure transaction response times
• Compared the observed worst case response times under different collectors
• Oh, and we ran all setups long enough to see an Old gen compaction event occur….
©2011 Azul Systems, Inc. 22 22
Sample throughput
• SpecJBB + Slow churning 2GB LRU Cache
• Live set is ~2.5GB across all measurements
• Allocation rate is ~1.2GB/sec across all measurements
©2011 Azul Systems, Inc. 23 23
Sample responsiveness improvement
• SpecJBB + Slow churning 2GB LRU Cache
• Live set is ~2.5GB across all measurements
• Allocation rate is ~1.2GB/sec across all measurements
©2011 Azul Systems, Inc. 24 24
Design goal: be insensitive…
• Heap Size
• Allocation rate
• Mutation rate
• Locality
• “non-generational” behavior
• …
Q & A The C4 Collector
©2011 Azul Systems, Inc. 26 26
Sustainable Remap Rates….
• Per 2MB of allocation: map… remap/protect… unmap…
• Need to keep up with sustained allocation rate ─ A modern x86 core will happily generate ~0.5GB/sec of garbage
• (m)remaping pages is only small part of GC cycle ─ Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle
─ So need to sustain 100s of GB/sec in mremap rate…
• Linux remaps sustain <1GB/sec
─ Dominated by unneeded semantics
─ TLB invalidates, 4KB mappings, global locking, …
• Enhanced kernel supports >6TB/sec sustained remap rates
─ Avoids in-process implicit TLB invalidates, uses 2MB mappings
©2011 Azul Systems, Inc. 27 27
Sustained Remap Rates
Active
threads
Mainline Linux w/Azul Memory Module Speedup
1 3.04 GB/sec 6.50 TB/sec >2,000x
2 1.82 GB/sec 6.09 TB/sec >3,000x
4 1.19 GB/sec 6.08 TB/sec >5,000x
8 897.65 MB/sec 6.29 TB/sec >7,000x
12 736.65 MB/sec 6.39 TB/sec >8,000x
©2011 Azul Systems, Inc. 28 28
Remap Commit Rates….
• Remap/protection must be consistent across mutator threads
• Each “batch” of relocated pages needs synchronization ─ In practical terms, we bring mutators to safe point, and flip pages
• Using linux mremap(), protecting 16GB would take ~20 sec.
• Enhanced kernel supports >800TB/sec remap commit rates
─ Uses shadow table and batch remap/protect ops api
─ Accumulated batch operations are not visible until committed
─ Commits shadow table using ~1 pointer copy per GB
─ Protecting 16GB takes about ~22 usec…
©2011 Azul Systems, Inc. 29 29
Remap Commit Rates
Active
threads
Mainline Linux w/Azul Memory Module Speedup
0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x
1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x
2 1.82 GB/sec (8 sec) 1166.04 TB/sec (14 usec) >640,000x
4 1.19 GB/sec (13 sec) 913.74 TB/sec (18 usec) >750,000x
8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x
12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x
* Commit rate and (time it would take to commit 16GB)
©2011 Azul Systems, Inc. 30 30
New programming models?
• The coherent, shared memory SMP model has endured
─ That’s how people program. Still...
• In the past 40 years, new programming models proposed
─ Whenever we run into a new “architectural limit”
─ Usually involve some sort of “loosely coupled memory”
─ New models are generally useful for “mega-scale” (moving target)
─ They don’t survive (for long) within a physical machine…
• 64KB not enough? (early 1980s)
─ 20 bit segmented memory for 16 bit processors (birth of x86)
• 640KB not enough? (early 1990s)
─ 32 bit operating systems, even in the “commodity/desktop” world
©2011 Azul Systems, Inc. 31 31
The “hard” things to do in GC
• Robust concurrent marking ─ Refs keep changing
─ Multi-pass marking sensitive to mutation rate
─ Weak, Soft, Final references “hard” to deal with concurrently
• [Concurrent] Compaction… ─ It’s not the moving of the objects…
─ It’s the fixing of all those references that point to them
─ How do you deal with a mutator looking at a stale reference?
─ If you can’t, then remapping is a STW operation
• Without solving Compaction, GC won’t be solved ─ All current commercial server JVMs and GCs perform compaction
─ Azul ships the only commercial JVMs that concurrently compact
©2011 Azul Systems, Inc. 32 32
Garbage Collection & Compaction You can delay it, but you cannot get rid of it
• Compaction is inevitable ─ And compacting anything requires scanning/fixing all references
─ Usually the worst possible thing that can happen in GC
• You can delay compaction, but not get rid of it
• Delay tactics focus on getting “easy empty space” first ─ This is the focus for the vast majority of GC tuning
• Most objects die young ─ So collect young objects only, as much as possible
─ But eventually, some old dead objects must be reclaimed
• Most old dead space can be reclaimed without moving it ─ So track dead space in lists, and reuse it in place
─ But eventually, space gets fragmented, and needs to be moved
• Eventually, all collectors compact the heap
©2011 Azul Systems, Inc. 33 33
HotSpot™ CMS Collector mechanism classification
• Stop-the-world compacting new gen (ParNew)
• Mostly Concurrent, non-compacting old gen (CMS)
─ Mostly Concurrent marking
─ Mark concurrently while mutator is running
─ Track mutations in card marks
─ Revisit mutated cards (repeat as needed)
─ Stop-the-world to catch up on mutations, ref processing, etc.
─ Concurrent Sweeping
─ Does not Compact (maintains free list, does not move objects)
• Fallback to Full Collection (Stop the world).
─ Used for Compaction, etc.
©2011 Azul Systems, Inc. 34 34
HotSpot “Garbage First” (aka G1) Collector mechanism classification
• Experimental
-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
• Stop-the-world compacting new gen
• Mostly Concurrent, old gen marker
─Mostly Concurrent marking
─Tracks inter-region relationships in remembered sets
• Stop-the-world incremental compacting old gen
─Objective: “Avoid, as much as possible, having a Full GC…”
─Compact sets of regions that can be scanned in limited time
─Delay compaction of popular objects, popular regions
• Fallback to Full Collection (Stop the world)
─Used for compacting popular objects, popular regions