Understanding Java Garbage Collection · 1_ Understanding Java Garbage Collection v1.pdf ... Cache...

transcript

Understanding Java

Garbage Collection

and what you can do about it

Graham Thomas, EMEA Technical Manager, Azul Systems

A presentation at Orange11

July 5, 2012

This Talk’s Purpose / Goals This talk is focused on GC education

This is not a “how to use flags to tune a collector” talk

This is a talk about how the “GC machine” works

Purpose: Once you understand how it works, you can

use your own brain...

You’ll learn just enough to be dangerous...

The “Azul makes the world’s greatest GC” stuff will only

come at the end, I promise...

White Papers to Accompany Slides

Listed in order of complexity:

1_ Understanding Java Garbage Collection v1.pdf

2_ Azul Pauseless Garbage Collection - wp_pgc_zing_v2.pdf

3_ C4-The Continuously Concurrent Compacting Collector -

c4_paper_acm.pdf

4_ AzulVmemMetricsMRI.pdf www.managedruntime.org/files/downloads/AzulVmemMetricsMRI.pdf

About Azul Systems

We deal with Java performance issues on a daily

Our solutions focus on consistent response time under load

We enable practical, full use of hardware resources

As a result, we often help characterize problems

In many/most cases, it’s not the database, app, or

network - it’s the JVM, or the system under it…

GC Pauses, OS or Virtualization “hiccups”, swapping, etc.

We use and provide simple tools to help discover

what’s going on in a JVM and the underlying platform

Focus on measuring JVM/Platform behavior with your app

Non-intrusive, no code changes, easy to add

About Azul Supporting mission-critical deployments around the

About Azul – 2002 to Now

We make scalable Virtual

Machines

Have built “whatever it takes to

get job done” since 2002

3 generations of custom SMP

Multi-core HW (Vega)

“Industry firsts” in Garbage

collection, elastic memory,

Java virtualization, memory

Vega • 54 Cores per Chip

• Up to 16 Chips (864 cores)

• 640 GB Heaps

About Azul – 2002 to Now

We make scalable Virtual

Machines

Have built “whatever it takes to

get job done” since 2002

3 generations of custom SMP

Multi-core HW (Vega)

Now Pure software for

commodity x86 (Zing)

“Industry firsts” in Garbage

collection, elastic memory,

Java virtualization, memory

Vega • 54 Cores per Chip

• Up to 16 Chips (864 cores)

• 640 GB Heaps

Azul Systems Vega processor Each Vega chip contains 54 fully independent processor cores and an integrated

quad-channel memory controller.

Each appliance contains up to 16 Vega chips,

providing up to 864 total processor cores (96, 192, 384, 768 core models are also

available)

Each processor core is a 64-bit RISC processor with optimizations for multi-threaded

VM execution

Three banks of four ECC memory modules are attached to each Vega chip

for a total of 192 memory modules in a 16-chip configuration

Heaps 640 GB

Cache coherent, uniform memory access through a passive, non-blocking interconnect

205 GBps aggregate memory bandwidth

544 GBps aggregate interconnect bandwidth

Instruction-level support for concurrent, pauseless, VM garbage collection.

Dual network processors for system control and I/O communications

Zing 5.2 Requirements Processors (Dual Socket preferred):

Intel Nehalem Xeon 5500, 56xx, 6500, 7500, E7-2xxx, E7-4xxx, E7-8xxx or

E5-26xx (with Intel-VT virtualization enabled)

AMD Opteron 2300, 2400, 4100, 6100, 8300 or 8400

(with AMD V virtualization enabled)

If virtualised, it is critical that Zing LX is run with reserved cores and memory

Memory and CPU Cores

32GB or greater

6 or more cores

OS – 64-bit only (Linux distro specific : rpm / deb packages)

Red Hat Enterprise Linux / CentOS

RHEL5.2++ / RHEL6.x | CentOS 5.2++ / CentOS 6

SUSE Linux Enterprise Server: SLES 11 SP1 / SLES 11 SP2

Ubuntu Linux (server / desktop): LTS 10.04 / LTS 12.04

Java SE 6 and most J2SE 5.0 apps*

<*J2SE 5.0 applications that use functionality not explicitly removed in Java SE 6>

Java update level from Sun Microsystems: J2SE v1.6.0 update 31

(“headless mode” only – no GUI)

High level agenda

GC fundamentals and key mechanisms

Some GC terminology & metrics

Classifying currently available collectors

The “Application Memory Wall” problem

The C4 collector: What an actual solution looks like...

Memory use

• How many of you use heap sizes of:

• more than ½ GB?

• more than 1 GB?

• more than 2 GB?

• more than 4 GB?

• more than 10 GB?

Why should you care about GC?

The story of the good little architect

A good architect must, first and foremost, be able to

impose their architectural choices on the project...

Early in Azul’s concurrent collector days, we encountered

an application exhibiting 18 second pauses

Upon investigation, we found the collector was performing 10s of millions

of object finalizations per GC cycle

*We have since made reference processing fully concurrent...

Every single class written in the project had a finalizer

The only work the finalizers did was nulling every reference field

The right discipline for a C++ ref-counting environment

The wrong discipline for a precise garbage collected environment

Trying to solve GC problems in application

architecture is like throwing knives

You probably shouldn’t do it blindfolded

It takes practice and understanding to get it right

You can get very good at it, but do you really want to?

Will all the code you leverage be as good as yours?

Examples:

Object pooling

Off heap storage

Distributed heaps

(In most cases, you end up building your own garbage collector)

Most of what People seem to “know”

about Garbage Collection is wrong

In many cases, it’s much better than you may think

GC is extremely efficient. Much more so than malloc()

newObject() with HotSpot is 10 machine instructions

malloc() in C average between 60 and 100 instructions per call

(See http://www.ibm.com/developerworks/java/library/j-jtp09275/index.html)

Dead objects cost nothing to collect

GC will find all the dead objects (including cyclic graphs)

In many cases, it’s much worse than you may think

Yes, it really does stop for ~1 sec per live GB.

No, GC does not mean you can’t have memory leaks

No, those pauses you eliminated from your 20 minute test are not

Some GC Terminology

A Basic Terminology example:

What is a concurrent collector?

A Concurrent Collector performs garbage collection work

concurrently with the application’s own execution

Generally uses multiple collector threads and Mutator threads are not

stopped

A Parallel Collector uses multiple CPUs to perform

garbage collection

But stops all Mutator threads during collection (aka stop-the-world)

A Concurrent Collector performs garbage collection work

concurrently with the application’s own execution

A Parallel Collector uses multiple CPUs to perform

garbage collection

Classifying a collector’s operation

An Incremental collector performs a garbage collection

operation or phase as a series of smaller discrete

operations with (potentially long) gaps in between

A Stop-the-World collector performs garbage collection

while the application is completely stopped

Mostly means sometimes it isn’t (usually means a

different fall back mechanism exists)

Precise vs. Conservative Collection

A Collector is Conservative if it is unaware of some object

references at collection time, or is unsure about whether a

field is a reference or not (e.g is an integer a pointer)

A Collector is Precise if it can fully identify and process all

object references at the time of collection

A collector MUST be precise in order to move objects

The COMPILERS need to produce a lot of information (oopmaps)

All commercial server JVMs use precise collectors

All commercial server JVMs use some form of a moving collector

Safepoints

A GC Safepoint is a point or range in a thread’s execution

where the collector can identify all the references in that

thread’s execution stack

“Safepoint” and “GC Safepoint” are often used interchangeably

But there are other types of safepoints, including ones that require more

information than a GC safepoint does (e.g. deoptimization)

“Bringing a thread to a safepoint” is the act of getting a

thread to reach a safepoint and not execute past it

Close to, but not exactly the same as “stop at a safepoint”

e.g. JNI: you can keep running in, but not past the safepoint

Safepoint opportunities are (or should be) frequent

In a Global Safepoint all threads are at a Safepoint

What’s common to all

precise GC mechanisms?

Identify the live objects in the memory heap

Reclaim resources held by dead objects

Periodically relocate live objects

Examples:

Mark/Sweep/Compact (common for Old Generations)

Copying collector (common for Young Generations)

Mark (aka “Trace”)

Start from “roots” (thread stacks, statics, etc.)

“Paint” anything you can reach as “live”

At the end of a mark pass:

all reachable objects will be marked “live”

all non-reachable objects will be marked “dead” (aka “non-

live”).

Note: work is generally linear to “live set”

Scan through the heap, identify “dead” objects and track them

somehow

(usually in some form of free list)

Note: work is generally linear to heap size

Compact

Over time, heap will get “swiss cheesed”: contiguous dead

space between objects may not be large enough to fit new

objects (aka “fragmentation”)

Compaction moves live objects together to reclaim contiguous

empty space (aka “relocate”)

Compaction has to correct all object references to point to new

object locations (aka “remap”)

Remap scan must cover all references that could possibly

point to relocated objects

Note: work is generally linear to “live set”

A copying collector moves all lives objects from a “from” space

to a “to” space & reclaims “from” space

At start of copy, all objects are in “from” space and all

references point to “from” space.

Start from “root” references, copy any reachable object to “to”

space, correcting references as we go

At end of copy, all objects are in “to” space, and all references

point to “to” space

Note: work generally linear to “live set”

Mark/Sweep/Compact, Copy, Mark/Compact

Copy requires 2x the max. live set to be reliable

Mark/Compact [typically] requires 2x the max. live set in order to

fully recover garbage in each cycle

Mark/Sweep/Compact only requires 1x (plus some)

Copy and Mark/Compact are linear only to live set

Mark/Sweep/Compact linear (in sweep) to heap size

Mark/Sweep/(Compact) may be able to avoid some moving work

Copying is [typically] “monolithic”

Generational Collection

“Weak Generational Hypothesis”: most objects die young

Focus collection efforts on young generation:

Use a moving collector: work is linear to the live set

The live set in the young generation is a small % of the space

Promote objects that live long enough to older generations

Only collect older generations as they fill up

“Generational filter” reduces rate of allocation into older generations

Tends to be (order of magnitude) more efficient

Great way to keep up with high allocation rate

Practical necessity for keeping up with processor throughput

Generational Collection /2

Requires a “Remembered set”: a way to track all references into

the young generation from the outside

Remembered set is also part of “roots” for young generation

collection

No need for 2x the live set: Can “spill over” to old gen

Usually want to keep surviving objects in young generation for a

while before promoting them to the old generation

Immediate promotion can dramatically reduce gen. filter efficiency

Waiting too long to promote can dramatically increase copying work

How does the remembered set work?

Generational collectors require a “Remembered set”: a way to

track all references into the young generation from the outside

Each store of a NewGen reference into and OldGen object

needs to be intercepted and tracked

Common technique: “Card Marking”

A bit (or byte) indicating a word (or region) in OldGen is “suspect”

Write barrier used to track references

Common technique (e.g. HotSpot): blind stores on reference write

Variants: precise vs. imprecise card marking,

conditional vs. non-conditional

The typical combos

in commercial server JVMS

Young generation usually uses a copying collector

Young generation is usually Monolithic stop-the-world

Old generation usually uses Mark/Sweep/Compact

Old generation may be STW, or Concurrent,

or mostly-Concurrent, or Incremental-STW,

or mostly-Incremental-STW

Mutator Your program…

Parallel Can use multiple CPUs

Concurrent Runs concurrently with program

Pause A time duration in which the mutator

is not running any code

Stop-The-World (STW) Something that is done in a pause

Monolithic Stop-The-World Something that must be done in it’s

entirety in a single pause

Useful terms for discussing garbage

collection Generational

Collects young objects and long lived

objects separately.

Promotion Allocation into old generation

Marking Finding all live objects

Sweeping Locating the dead objects

Compaction Defragments heap

Moves objects in memory

Remaps all affected references

Frees contiguous memory regions

Useful metrics for discussing garbage

collection

Cycle time How long it takes the collector to free up

memory

Marking time How long it takes the collector to find all

live objects

Sweep time How long it takes to locate dead objects

* Relevant for Mark-Sweep

Compaction time How long it takes to free up memory by

relocating objects

* Relevant for Mark-Compact

Heap population (aka Live set) How much of your heap is alive

Allocation rate How fast you allocate

Mutation rate How fast your program updates

references in memory

Heap Shape The shape of the live object graph

* Hard to quantify as a metric...

Object Lifetime How long objects live

Empty memory and CPU/throughput

Two Intuitive limits

If we had infinite empty memory, we would never have to

collect, and GC would take 0% of the CPU time

If we had exactly 1 byte of empty memory at all times, the

collector would have to work “very hard”, and GC would take

100% of the CPU time

GC CPU % will follow a rough 1/x curve between these two limit

points, dropping as the amount of memory increases.

Empty memory needs (empty memory == CPU power)

The amount of empty memory in the heap is the dominant

factor controlling the amount of GC work

For both Copy and Mark/Compact collectors, the amount of

work per cycle is linear to live set

The amount of memory recovered per cycle is equal to the

amount of unused memory (heap size) - (live set)

The collector has to perform a GC cycle when the empty

memory runs out

A Copy or Mark/Compact collector’s efficiency doubles with

every doubling of the empty memory.

What empty memory controls

Empty memory controls efficiency (amount of collector work

needed per amount of application work performed)

Empty memory controls the frequency of pauses (if the

collector performs any Stop-the-world operations)

Empty memory DOES NOT control pause times (only their

frequency)

In Mark/Sweep/Compact collectors that pause for sweeping,

more empty memory means less frequent but LARGER

pauses

Some non monolithic-STW stuff

Concurrent Marking

Mark all reachable objects as “live”, but object graph is

“mutating” under us.

Classic concurrent marking race: mutator may move reference

that has not yet been seen by the marker, into an object that has

already been visited

If not intercepted or prevented in some way, will corrupt the heap

Example technique: track mutations, multi-pass marking

Track reference mutations during mark (e.g. in card table)

Re-visit all mutated references (and track new mutations)

When set is “small enough”, do a STW catch up (mostly concurrent)

Note: work grows with mutation rate, may fail to finish

Incremental Compaction

Track cross-region remembered sets (which region points to

which)

To compact a single region, only need to scan regions that point

into it to remap all potential references

identify regions sets that fit in limited time

Each such set of regions is a Stop-the-World increment

Safe to run application between (but not within) increments

Note: work can grow with the square of the heap size

The number of regions pointing into a single region is generally linear

to the heap size (the number of regions in the heap)

Classifying common collectors

The typical combos

in commercial server JVMS

Young generation usually uses a copying collector

Young generation is usually Monolithic stop-the-world

Old generation usually uses a Mark/Sweep/Compact collector

Old generation may be STW, or Concurrent, or mostly-Concurrent, or

Incremental-STW, or mostly-Incremental-STW

HotSpot™ ParallelGC Collector mechanism classification

Monolithic Stop-the-world copying NewGen

Monolithic Stop-the-world Mark/Sweep/Compact OldGen

HotSpot™ CMS Collector mechanism classification

ConcMarkSweepGC (CMS)

Monolithic Stop-the-world copying NewGen (ParNew)

Mostly Concurrent, non-compacting OldGen (CMS)

Mostly Concurrent marking

Mark concurrently while mutator is running

Track mutations in card marks

Revisit mutated cards (repeat as needed)

Stop-the-world to catch up on mutations, ref processing, etc.

Concurrent Sweeping

Does not Compact (maintains free list, does not move objects)

Fallback to Full Collection (Monolithic Stop the world).

Used for Compaction, etc.

HotSpot™ G1GC Collector mechanism classification

Monolithic Stop-the-world copying NewGen

Mostly Concurrent, OldGen marker

Mostly Concurrent marking

Stop-the-world to catch up on mutations, ref processing, etc.

Tracks inter-region relationships in remembered sets

Stop-the-world mostly incremental compacting old gen

Objective: “Avoid, as much as possible, having a Full GC…”

Compact sets of regions that can be scanned in limited time

Delay compaction of popular objects, popular regions

Fallback to Full Collection (Monolithic Stop the world).

Used for compacting popular objects, popular regions, etc.

Some Collectors Collector Name Young Generation Old Generation

Oracle HotSpot ParallelGC Monolithic stop-the-world,

copying

Monolithic stop-the-world,

Mark/Sweep/Compact

Oracle HotSpot CMS

(Concurrent Mark/Sweep)

copying

Mostly concurrent, non-

compacting , fall back to

monolithic stop-the-world

Oracle HotSpot G1

(Garbage First)

copying

Mostly concurrent marker, mostly

incremental compaction, fall back

to monolithic stop-the-world

Oracle JRockit * Dynamic Garbage Collector Monolithic stop-the-world,

copying

Mark/Sweep - can choose mostly

concurrent or parallel,

incremental compaction, fall back

to monolithic stop-the-world

IBM J9 * Balanced Monolithic stop-the-world,

copying

Mostly concurrent marker, mostly

incremental

compaction, fall back to

monolithic stop-the-world

IBM J9 * optthruput Monolithic stop-the-world,

copying

Parallel Mark/Sweep,

stop-the-world compaction

* Can choose a single or 2-generation collector

Zing C4

(Continuously Concurrent

Compacting Collector)

Concurrent and always

compacting

Concurrent and always

compacting

The “Application Memory Wall”

Reality check: servers in 2012

16 vCore, 96GB server ≈ $5K

32 vCore, 1TB server ≈ $35K

Retail prices, major web server store (US $, May 2012)

Cheap (< $1/GB/Month), and roughly linear to ~1TB

10s to 100s of GB/sec of memory bandwidth

The Application Memory Wall

A simple observation:

Application instances appear to be unable to

make effective use of modern server memory

capacities

The size of application instances as a % of a

server’s capacity is rapidly dropping

Maybe 1+ to 4+ GB is simply enough?

We hope not (or we’ll all have to look for new jobs soon)

Plenty of evidence of pent up demand for more heap:

Common use of lateral scale across machines

Common use of “lateral scale” within machines

Use of “external” memory with growing data sets

Databases certainly keep growing

External data caches (memcache, JCache, Data grids)

Continuous work on the never ending distribution problem

More and more reinvention of NUMA

Bring data to compute, bring compute to data

How much memory do applications need? “640KB ought to be enough for anybody”

WRONG!

So what’s the right number?

6,400K?

64,000K?

640,000K?

6,400,000K?

64,000,000K?

There is no right number

Target moves at 50x-100x per decade

“I've said some stupid things

and some wrong things, but

not that. No one involved in

computers would ever say that

a certain amount of memory is

enough for all time …” - Bill

Gates, 1996

“Tiny” application history

100KB apps on a ¼ to ½ MB Server

10MB apps on a 32 – 64 MB server

1GB apps on a 2 – 4 GB server

??? GB apps on 256 GB

Assuming Moore’s Law means:

“transistor counts grow at ≈2x

every ≈18 months”

It also means memory size grows

≈100x every 10 years

“Tiny”: would be “silly” to distribute

Application Memory

What is causing the

Application Memory Wall?

Garbage Collection is a clear and dominant cause

There seem to be practical heap size limits for

applications with responsiveness requirements

[Virtually] All current commercial JVMs will exhibit a multi-

second pause on a normally utilized 2-4GB heap.

It’s a question of “When” and “How often”, not “If”.

GC tuning only moves the “when” and the “how often” around

Root cause: The link between scale and responsiveness

What quality of GC is responsible

for the Application Memory Wall?

It is NOT about overhead or efficiency:

CPU utilization, bottlenecks, memory consumption and utilization

It is NOT about speed

Average speeds, 90%, 99% speeds, are all perfectly fine

It is NOT about minor GC events (right now)

GC events in the 10s of msec are usually tolerable for most apps

It is NOT about the frequency of very large pauses

It is ALL about the worst observable pause behavior

People avoid building/deploying visibly broken systems

Application Characterization

Mantra

Throughput without response time is meaningless

Sustainable throughput is all that matters

Sustainable Throughput: The throughput achieved while safely

maintaining service levels

Unsustainable

Throughout

GC Problems

Framing the discussion:

Garbage Collection at modern server scales

Modern Servers have 100s of GB of memory

Each modern x86 core (when actually used) produces

garbage at a rate of ¼ - ½ GB/sec +

That’s many GB/sec of allocation in a server

Monolithic stop-the-world operations are the cause of the

current Application Memory Wall

How to ignore Monolithic-STW GC events

Delaying the inevitable

Delay tactics focus on getting “easy empty space” first This is the focus for the vast majority of GC tuning

Most objects die young [Generational] So collect young objects only, as much as possible

But eventually, some old dead objects must be reclaimed

Most old dead space can be reclaimed without moving it

[e.g. CMS] track dead space in lists, and reuse it in place

But eventually, space gets fragmented, and needs to be moved

Much of the heap is not “popular” [e.g. G1, “Balanced”] A non popular region will only be pointed to from a small % of the heap

So compact non-popular regions in short stop-the-world pauses

But eventually, popular objects and regions need to be compacted

Young generation pauses are only small because heaps are tiny

A 200GB heap will regularly have several GB of live

How can we break through the Application Memory Wall?

We need to solve the right problems

Focus on the causes of the Application Memory Wall

Root cause: Scale is artificially limited by responsiveness

Responsiveness must be unlinked from scale

Heap size, Live Set size, Allocation rate, Mutation rate

Responsiveness must be continually sustainable

Can’t ignore “rare” events

Eliminate all Stop-The-World Fallbacks

At modern server scales, any STW fall back is a failure

Problems that need solving (areas where the state of the art needs improvement)

Robust Concurrent Marking

In the presence of high mutation and allocation rates

Cover modern runtime semantics (e.g. weak refs)

Compaction that is not monolithic-stop-the-world

Stay responsive while compacting many-GB heaps

Must be robust: not just a tactic to delay STW compaction

[current “incremental STW” attempts fall short on robustness]

Non-monolithic-stop-the-world Generational collection

Stay responsive while promoting multi-GB data spikes

Concurrent or “incremental STW” may both be ok

[ Surprisingly little work done in this specific area]

The things that seem “hard” to do in GC

Robust concurrent marking

References keep changing

Multi-pass marking is sensitive to mutation rate

Weak, Soft, Final references “hard” to deal with concurrently

[Concurrent] Compaction…

It’s not the moving of the objects…

It’s the fixing of all those references that point to them

How do you deal with a Mutator looking at a stale reference?

If you can’t, then remapping is a [monolithic] STW operation

Young Generation collection at scale

Young Generation collection is generally Monolithic stop-the-world

Young generation pauses are only small because heaps are tiny

A 100GB heap will regularly see multi-GB of live young stuff…

Azul’s “C4” Collector Continuously Concurrent Compacting Collector

Concurrent, compacting new generation

Concurrent, compacting old generation

Concurrent guaranteed-single-pass marker

Oblivious to mutation rate

Concurrent ref (weak, soft, final) processing

Concurrent Compactor

Objects moved without stopping mutator

References remapped without stopping mutator

Can relocate entire generation (New, Old) in every GC cycle

No stop-the-world fallback

Always compacts, and always does so concurrently

Sample responsiveness improvement

๏ SpecJBB + Slow churning 2GB LRU Cache

๏ Live set is ~2.5GB across all measurements

๏ Allocation rate is ~1.2GB/sec across all measurements

Instance capacity test: “Fat Portal” CMS: Peaks at ~ 3GB / 45 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Instance capacity test: “Fat Portal” C4: still smooth @ 800 concurrent users

Some fun with jHiccup

Idle App on Quiet System Idle App on Busy System

Idle App on Dedicated System Idle App on Quiet System

A good use for jHiccup

Oracle HotSpot CMS, 1GB in an 8GB Heap Oracle HotSpot CMS, 4GB in an 18GB Heap

Oracle HotSpot G1, 1GB in an 8GB Heap Oracle HotSpot ParallelGC, 1GB in an 8GB Heap

Oracle HotSpot CMS, 1GB in an 8GB Heap Zing 5, 1GB in an 8GB Heap

Java GC tuning is “hard”… •Examples of actual command line GC tuning parameters:

•Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g

• -XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC

• -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0

• -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled

• -XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12

• -XX:LargePageSizeInBytes=256m …

•Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M

•-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy

•-XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled

•-XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled

•-XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly

•-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …

The complete guide to

Zing GC tuning

java -Xmx40g

C4 Algorithm fundamentals

C4 algorithm highlights Same core mechanism used for both generations

Concurrent Mark-Compact

A Loaded Value Barrier (LVB) is central to the algorithm Every heap reference is verified as “sane” when loaded

“Non-sane” refs are caught and fixed in a self-healing barrier

Refs that have not yet been “marked through” are caught Guaranteed single pass concurrent marker

Refs that point to relocated objects are caught Lazily (and concurrently) remap refs, no hurry

Relocation and remapping are both concurrent

Uses “quick release” to recycle memory Forwarding information is kept outside of object pages

Physical memory released immediately upon relocation

“Hand-over-hand” compaction without requiring empty memory

Summary

The C4 GC Cycle

Mark Phase

Mark phase finds all live objects in the Java heap

Concurrent, predictable: always complete in a single pass

Uses LVB to defeat concurrent marking races Tracks object references that have been traversed by using an “NMT”

(not marked through) metadata state in each object reference

Any access to a not-yet-traversed reference will trigger the LVB

Triggered references are queued on collector work lists, and reference

NMT state is corrected

“Self healing” corrects the memory location that the reference was loaded

Marker tracks total live memory in each memory page

Compaction uses this to go after the sparse pages first

(But each cycle will tend to compact the entire heap…)

Relocate Phase Compacts to reclaim heap space

occupied by dead objects in “from”

pages without stopping mutator

Protects “from” pages.

Uses LVB to support concurrent

relocation and lazy remapping by

triggering on any access to

references to “from” pages

Relocates any live objects to newly

allocated “to” pages

Maintains forwarding pointers

outside of “from” pages

Virtual “from” space cannot be

recycled until all references to

relocated objects are remapped

“Quick Release”: Physical memory

can be immediately reclaimed, and

used to feed further compaction or

allocation

Remap Phase

Scans all live objects in the heap

Looks for references to previously relocated objects, and

updates (“remaps”) them to point to the new object

locations

Uses LVB to support lazy remapping Any access to a not-yet-remapped reference will trigger the LVB

Triggered references are corrected to point to the object’s new location

by consulting forwarding pointers

“Self healing” corrects the memory location the reference was loaded

Overlaps with the next mark phase’s live object scan Mark & Remap are executed as a single pass

Per 2MB of allocation: map… remap/protect… unmap…

Need to keep up with sustained allocation rate

A modern x86 core will happily generate ~0.5GB/sec of garbage

(m)remaping pages is only small part of GC cycle

Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle

So need to sustain 100s of GB/sec in mremap rate…

Linux remaps sustain <1GB/sec

Dominated by unneeded semantics

TLB invalidates, 4KB mappings, global locking, …

Zing remaps sustain >6TB/sec (through ZST memory service)

Avoids in-process implicit TLB invalidates, uses 2MB mappings

Sustainable Remap Rates….

The C4 GC Cycle

Summary

The Application Memory Wall is HERE, NOW Driven by detrimental link between scale and responsiveness

Solving a handful of problems can lead to breakthrough Robust Concurrent Marking

[Concurrent] Compaction

non-monolithic STW young generation collection

All at modern server-scales

Solving it will [hopefully] allow applications to resume their

natural rate of consuming computer capacity

Implications of breaking past the Application

Memory Wall Improve quality of current systems:

Better & consistent response times, stability & availability

Reduce complexity, time to market, and cost

Scale Better: Large or variable number of concurrent users

High or variable transaction rates

Large data sets

Change how things are done: Aggressive Caching, in-memory data processing

Multi-tenant, SaaS, PaaS

Cloud deployments

Build applications that were not possible before…

How can we break through the Application Memory Wall?

Simple: Deploy Zing on Linux

How is Azul’s Java Platform Different? Same JVM standard -

Licensed Sun Java-source based JVM

Enhanced Hotspot, fully Java compatible

passes Java Compatibility Kit (JCK) server-level compatibility (~53000 tests )

A different approach

Garbage is good!

Designed with insight that worst case must eventually happen

Unique values

Highly scalable … 100s GB with consistent low pause times – other JVMs will have longer

“stop-the-world” pauses in proportion to size of JVM and memory allocation rate

Elastic memory … insurance for JVMs to handle dynamic load – unlike other JVMs which are

rigidly tuned

Collects New Garbage and Old Garbage concurrently with running application threads …

there is no “stop-the-world” for GC purposes (you will only see extremely short pause times to

reach safepoints) – unlike other JVMs which will eventually stop-the-world.

Compacts Memory concurrently with your application threads running … Zing will move

objects without “stop-the-world” or single-threading – which is a major issue with other JVMs

Measuring pause times from FIRST thread stopped (unlike other JVMs)

Rich non-intrusive production visibility with ZVision and ZVRobot

WYTIWYG (What You Test Is What You Get)

Azul’s Proposition

Performance

Pauseless Garbage Collection

Heap sizes of 100’s of GB

Concurrent Applications or Single Large Resource Pool

Assurance

Every application afforded ability to scale when needed

When spikes occur, additional resources granted in ms

Ensures performance & eliminates application crashes

Visibility

Visibility down to thread level

Quickly eliminate constraints

Reduce Development Cycles

“What You Test Is What You Get”

“WYTIWYG”

Consolidation

Lower TCO

Reduction – Power and Cooling

Reduction of Complexity

Reduction of Operations overhead

Cost Avoidance for Additional Servers

Use Cases: Zing Applications

Better & consistent response times

Greater stability & availability

Large or variable number of concurrent users

Large data sets

Caching, In-memory data processing

ESBs, SOA, Messaging

Multi-tenant, Platform-as-a-Service (PaaS)

Virtualized & Cloud deployments

Q & A How can we break through the Application Memory Wall?

http://www.azulsystems.com

Simple: Deploy Zing on Linux

G. Tene, B. Iyengar and M. Wolf

C4: The Continuously Concurrent Compacting Collector

In Proceedings of the international symposium on Memory

management, ISMM’11, ACM, pages 79-88

Jones, Richard; Hosking, Antony; Moss, Eliot (25 July 2011).

The Garbage Collection Handbook:

The Art of Automatic Memory Management.

CRC Press. ISBN 1420082795.

ZING VISION

Problem: JVMs are Black Boxes Java has a good ecosystem of Dev/Test profiling tools

Deep, sophisticated instrumentation

Always comes at a cost (Sometimes 5%, sometimes 10x)

Production applications run into problems

This is the real world…

Some problems make it through QA

Some real world loads were never seen in the lab

Production-time visibility is poor

When problems are escalated, problem solvers lack tools for

diagnosing and resolving cause

Can’t turn on lab visibility tools

The application is already having a problem

Adding any instrumentation load will make it worse

Solution: Zing Vision

Non-intrusive, Zero-overhead Visibility Zing Runtime ALWAYS collects instrumentation data

Side-effects of work the runtime has to do anyway

e.g. JIT Compilers need to track hot code anyway

e.g. GC needs to scan all objects anyway

May as well keep information for production-time viewing

Information used without fear of hurting application

Publishes data in accessible XML structures (can be saved to disk by ZVRobot)

No expensive polling or impacting requests, no JNI calls

Zing Vision provides deep, drill-down detail

Hot code, Hot Threads

Lock contention & deadlocks

Memory behavior, mix, live objects, allocation rates, GC stats.

Etc...

Production problems get diagnosed in 1/10th the time

Launch quickly, then optimize as-you-go •

Rich, Detailed Performance Data

Lock contention & deadlocks

Hotspots & method optimization

Down to the byte code level!

Memory profiling & GC statistics

Memory leak detection

Allocation rates

Thread profiling

Ticks data

Monitors

Application Host 1

Separate machine

or developers

PC / laptop

MyApp1 started

with Zing VM

ZVision

Zing Vision Development

Deployment

ZVision connects to ARTAPort on Zing JVM

and polls using HTTP get

Application Host 1

Separate machine

MyApp1 started

with Zing VM

ZVision MyApp2 started

with Zing VM

Application Host 2

MyApp3 started

with Zing VM

Zing Vision Production

Deployment

Application Host 1

Separate machine

MyApp1 started

with Zing VM

ZVision MyApp2 started

with Zing VM

Application Host 2

MyApp3 started

with Zing VM

Zing Vision Robot

Production Deployment

Separate machine

ZVRobot

ZVRobot connects to ARTAPort on Zing JVM and polls

using HTTP get at a configured time interval

– data collected is saved as XML files.

ZVRobot can be started & stopped as required to monitor application

Zing Vision : Always-On Visibility in

Production Systems

[Always] run with Zing ARTAPort enabled

if you want to use Zing Vision / ZVRobot to: Diagnose unplanned, unanticipated situations

Quickly resolve production issues as they arise

Increase application reliability and performance

Optimize performance of systems at full load

Deploy applications faster, with less tuning

Track resources in real-time at the instance level

What You Test Is What You Get

AZUL C4

More about how the Azul JVM and C4 works

Tiered Compilation and other optimizations

“Pauseless GC” a.k.a. C4

JVM Innovations

Tiered compilation by default

Dynamic self-sizing thread pools for

compilers

High-performance implementation of Locking

JVM Optimizations

Interpreter: “dictionary approach”, non-optimized

C1: 5x-10x better performance than interpretation

Quick, good for application start-up times

C2: 30%-50% better performance than C1

More profiling time/footprint, but much better performance for server applications

Tiered compilation – the best of both worlds

Using both C1 and C2

Re-use of C1-profiling, while also inserting additional performance counters

Fast code immediately, optimal code over time

Tiered Compilation

Remember?

Garbage is Good!

Fragmentation is Bad!

There are ways to delay the worst case, but it will

eventually happen

Compaction (moving objects together) is the only way to

deal with fragmentation

For other JVMs, moving objects is expensive, as the world

needs to stop to update references…

How Zing’s Continuous Concurrent

Compacting Collector Works…

Object

Virtual Address Space

Fragmented Memory Page (From)

Empty Memory Page (To)

• Memory is Fragmented

Object

Compacted Memory Page (To)

• Memory is compacted by

moving page contents

• Fragmented Memory Page

is now considered ‘empty’

and returned to C4 for reuse

Object

Ref A is accessed • Memory reference is checked

by Zing

Object

• This instance of Ref A is

updated or ‘self-healed’

• Other live references to Ref A

will be updated during

Mark/Relocate Phase

Mark phase finds all live objects in the Java heap

Concurrent & predictable: always completes in a single pass

Uses LVB to defeat concurrent marking races

Tracks object references that have been traversed by using an

“NMT” (not marked through) metadata bit in each object reference

Any access to a not-yet-traversed reference will trigger the LVB

Triggered references are queued on collector work lists, and reference NMT state is

corrected

“Self healing” corrects the memory location that the reference was loaded from

Marker tracks the total live memory in each memory page

Compaction uses this to go after the sparse pages first

(But each cycle will tend to compact the entire heap…)

Mark Phase

The C4 GC Cycle

Compacts to reclaim heap space occupied

by dead objects in “from” pages without

stopping mutator

Protects “from” pages (virtual address

space)

Uses LVB to support concurrent relocation

and lazy remapping by triggering on any

access to references to “from” pages

Relocates any live objects to newly

allocated “to” pages

Maintains forwarding pointers outside of

“from” pages

Virtual “from” space cannot be recycled

until all references to relocated objects are

remapped

“Quick Release”: Physical memory can be

immediately reclaimed, and used to feed

further compaction or allocation

Relocate Phase

Scans all live objects in the heap

Looks for references to previously relocated objects, and updates (“remaps”)

them to point to the new object locations

Uses LVB to support lazy remapping

Any access to a not-yet-remapped reference will trigger the LVB

Triggered references are corrected to point to the object’s new location by consulting

forwarding pointers

“Self healing” corrects the memory location the reference was loaded from

Overlaps with the next Mark phase’s live object scan

Mark & Remap are executed as a single pass

Remap Phase

The C4 GC Cycle

High performance memory

functionality High performance functionality requires physical memory management and

control beyond that provided by standard Linux virtual memory functionality.

in-process recycling of memory and in-process memory free lists can dramatically dampen

TLB invalidate requirements on allocation or deallocation edges.

in-process physical memory free lists are necessary to sustain a high rate of new mappings

(e.g. 20GB/sec of sustained random, disjoint map/remap/unmap operations)

Support for mappings with multiple and mixed page sizes

Including transitioning of mapped addresses from large to small page mappings, or small to

large.

Virtual and Physical modules for Linux

Support for very high sustained mapping modification rates:

Allowing concurrent modifications within the same address space

Allowing user to [safely] indicate lazy TLB invalidation and thereby dramatically reduce per

change costs

Supporting fast, safe application of very large "batch" sets of mapping modifications

(remaps and mprotects), such that all changes become visible within the same, extremely

short period of time.

Support for large number of disjoint mappings with arbitrary manipulations at

high rates

See http://www.managedruntime.org/downloads

Per 2MB of allocation: map… remap/protect… unmap…

Need to keep up with sustained allocation rate

A modern x86 core will happily generate ~0.5GB/sec of garbage

(m)remaping pages is only small part of GC cycle

Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle

So need to sustain 100s of GB/sec in mremap rate…

Linux remaps sustain <1GB/sec

Dominated by unneeded semantics

TLB invalidates, 4KB mappings, global locking, …

Zing’s Enhanced kernel

supports >6TB/sec sustained remap rates

Avoids in-process implicit TLB invalidates, uses 2MB mappings

Sustainable Remap Rates….

Sustained Remap Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

1 3.04 GB/sec 6.50 TB/sec >2,000x

2 1.82 GB/sec 6.09 TB/sec >3,000x

4 1.19 GB/sec 6.08 TB/sec >5,000x

8 897.65 MB/sec 6.29 TB/sec >7,000x

12 736.65 MB/sec 6.39 TB/sec >8,000x

Remap Commit Rates….

Remap/protection must be consistent across mutator threads

Each “batch” of relocated pages needs synchronization

In practical terms, we bring mutators to safe point, and flip pages

Using Linux mremap(), protecting 16GB would take ~20 sec.

Zing’s Enhanced kernel

supports >800TB/sec remap commit rates

Uses shadow table and batch remap/protect ops api

Accumulated batch operations are not visible until committed

Commits shadow table using ~1 pointer copy per GB

Protecting 16GB takes about ~22 usec…

Remap Commit Rates

Active

threads

Mainline Linux w/Azul Memory Module Speedup

0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x

1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x

8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x

12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x

* Commit rate and (time it would take to commit 16GB)

Same approach used for both generations

Concurrent Mark-Compact

A Loaded Value Barrier (LVB) is central to the algorithm

Every heap reference is verified as “sane” when loaded

“Non-sane” refs are caught and fixed in a self-healing barrier

Refs that have not yet been “marked through” are caught

Guaranteed single pass concurrent marker

Refs that point to relocated objects are caught

Lazily (and concurrently) remap refs, no hurry

Relocation and remapping are both concurrent

Uses “quick release” to recycle memory

Forwarding information is kept outside of object pages

Physical memory released immediately upon relocation

“Hand-over-hand” compaction without requiring empty memory

Pauseless GC, a.k.a “C4” A taste of the secret sauce

Use Cases: Zing Applications

Better & consistent response times

Greater stability & availability

Large or variable number of concurrent users

Large data sets

Caching, In-memory data processing

ESBs, SOA, Messaging

Multi-tenant, Platform-as-a-Service (PaaS)

Virtualized & Cloud deployments

For more information on…

JDK internals: http://openjdk.java.net/ (JVM source code)

Memory management:

http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf

(a bit old, but very comprehensive)

Tuning:

http://download.oracle.com/docs/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/tune

_stable_perf.html (watch out for increased rigidity and re-tuning pain)

Generational Pauseless Garbage Collection:

http://www.azulsystems.com/webinar/pauseless-gc (webinar by Gil Tene, 2011)

Compiler internals and optimizations:

http://www.azulsystems.com/blogs/cliff (Dr Cliff Click’s blog)

Additional Resources

Additional Material

Multimodal pauses in Financial Trading?

Mean = 10 milliseconds

Std. dev. = 1.0

you are in the market 90% of the time…

0 20 40 60 80 100 120 140

Latency (milliseconds)

Mode 2 Mode 3

Mode 1

Tolerable

latency

Not in the market – can’t make

money… but loss unlikely Out of the market for too long – may have a loss or

a risk position (can’t sell) = compliance issue

Multimodal pauses in eCommerce?

Mean = 10 deciseconds

Std. dev. = 1.0

you are in the market 90% of the time…

0 20 40 60 80 100 120 140

Latency (deciseconds)

Mode 2 Mode 3

Mode 1

Normal

Trading

Customer dissatisfaction … where

did my transaction go Abandoned shopping cart, went to another

web site. Lost Customer – perhaps for good

0 50 100 150 200 250

Example of naïve %’ile measurement

1 msec

System Stalled

for 100 Sec

Elapsed Time

System easily handles

100 requests/sec

Responds to each

in 1msec

How would you characterize this system? Naïve results:

10,000 @ 1msec 1 @ 100 second

Naïve characterization: 99.99% below 1 sec !!!

0 50 100 150 200 250

Proper measurement

System Stalled

for 100 Sec

Elapsed Time

System easily handles

100 requests/sec

Responds to each

in 1msec

10,000 results

Varying linearly

from 100 sec

to 10 msec 10,000 results

@ 1 msec each

Proper characterization: 50% below 1 second

How Zing ZST Memory Service provisions memory

Zing Resource Controller (ZRC)

OTHER FREE TOOLS

jHiccup | Fragger | Azul Inspector

So we built some new tools

jHiccup

Fragger

Both open sourced and available on Azul Website

jHiccup and Fragger: what they do

jHiccup – characterizes response time

Measures system/platform lag, under your application load

Reports accumulated counts of delay occurrences

Can measure latency between a client and Java application node or

within the node

Fragger – heap fragmentation inducer

Generates large sets of objects of a given size

Prunes each set to a smaller, remaining live set

Increases object size between passes until compaction is inevitable

Ages data sets to avoid artificial early compaction by young generation

collectors

Summary

Common pitfalls, but easy to overcome

Very simple tools can help a lot

Platforms that run smoothly are good

How to get Fragger and jHiccup

Fragger:

http://www.azulsystems.com/resources/tools/fragger

jHiccup:

http://www.azulsystems.com/dev_resources/jhiccup

Also useful: Azul Inspector Environment Checker

Azul Inspector is a Java program designed to collect information about a

target Java application and its server environment. Developers, IT and

performance engineers can use Azul Inspector to quickly determine the

JDK version in use, maximum heap size setting and the values of a

variety of other setup variables.

Azul Inspector

Azul application

characterization mantra

For More Information: Web: www.azulsystems.com

jHiccup: www.azulsystems.com/jhiccup

Fragger: www.azulsystems.com/resources/tools/fragger

Azul Inspector:

www.azulsystems.com/dev_resources/azul_inspector

Java Developer Webinars:

http://www.azulsystems.com/resources/tools#webinars

Zing Free Trial: www.azulsystems.com/trial

Azul application

characterization mantra

Improve response times:

Increase Transaction rates:

Increase Concurrent users:

Forget about GC pauses:

Eliminate daily restarts:

Elastically grow during peaks:

Elastically shrink when idle:

Gain production visibility:

If you want to:…

The Zing Platform:

Application Benefits

Zing™ Platform

On commodity H/W

32-bit versus 64-bit JVMs

Understanding Java Garbage Collection · 1_ Understanding Java Garbage Collection v1.pdf ... Cache...

Documents