Trust but verifyA year with Cassandra and the hunt for native memory JVM leaks
Chris Burroughs
Clearspring
2011-08-22
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 1 / 34
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 2 / 34
Table of Contents
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 3 / 34
Hello!
Chris Burroughs [email protected]
Active in the Apache Cassandra and (incubating) Kafka communities
A few mostly minor tickets: 1966, 2082, 2551
http://www.meetup.com/Cassandra-DC-Meetup/
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 4 / 34
We are hiring
http://www.clearspring.com/about/careers
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 5 / 34
What is this talk about?
Some of what we learned after using Cassandra for a year.
Particularly as we struggled with with unbounded RES growth. Mostof this is applicable to any JVM program.
I’ve tried to explain things when they make sense, not chronologicallywhen we figured them out. (But feel free to ask questions)
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 6 / 34
Disclaimers
I have come out of this with a general positive view of Cassandraeven though getting there sucked.
This is mostly about what I learned, to the extent that there were“discoveries” they were made by others.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 7 / 34
Disclaimers
I have come out of this with a general positive view of Cassandraeven though getting there sucked.
This is mostly about what I learned, to the extent that there were“discoveries” they were made by others.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 7 / 34
Table of Contents
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 8 / 34
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 9 / 34
Sharecounter
Capacity planning conundrum: The counter will account for between0 and 100% of views within ? days?/weeks?/months?
Primary considerations: Proven, incremental, horizontal scalability.Tolerance to individual node failures.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 10 / 34
Sharecounter
Capacity planning conundrum: The counter will account for between0 and 100% of views within ? days?/weeks?/months?
Primary considerations: Proven, incremental, horizontal scalability.Tolerance to individual node failures.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 10 / 34
Tangent: Counters
We did not use CASSANDRA-1072 counters.
(Probably will in the future depending on results of SSTable compressiontests.)
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 11 / 34
Tangent: Counters
We did not use CASSANDRA-1072 counters.
(Probably will in the future depending on results of SSTable compressiontests.)
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 11 / 34
Table of Contents
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 12 / 34
JVM: Cocoon
JVM, on heap: “Normal” place for allocation. You can set a max sizeof n bytes.
I Max heap size seems to get a reasonable amount of respect from theJVM.
I But the heap can fragment and take up more than n bytes. This isdifficult to detect.
JVM, off heap: Give me some bytes! You can use either useDirectByteBuffer’s yourself, or it’s likely that you use a library thatdoes (NIO).
JVM, permgen: Classes and stuff like that.
Other: Hotspot is a C++ program. It can use memory for whateverit needs to do.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 13 / 34
Linux: Harsh Reality
Resident set size: The least bad measure of how much memory aprocess is using.
mmap(2): mmap-ed files are counted as part of your PIDs RSS.Reduces visibility (have fun with pmap and friends), may be faster.
Linux does not care about your nice heap abstractions, it’s just anotherprocess.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 14 / 34
Wizard is Out Of Mana
JVM: OutOfMemory Exception → Nice log messages with a clue towhat happened.
Linux: The kernel needs more memory → it kills processes until it’ssatisfied. Check dmesg.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 15 / 34
Table of Contents
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 16 / 34
First test stack failure
A node in the test stack dies at 2010-10-10 at 3:15pm.
Around this time there was a large and unexplained increase inCPU utilization
$ dmesg | grep -i oom
syslogd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
java invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
That was weird, decrease max heap size and forget about it.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 17 / 34
About a month later . . .
All production servers die within an hour or so of each other.
On failures
We often model as if failures are uncorrelated.
This isn’t really true for hardware (ie same model disks), but itdefinitely is not true for software.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 18 / 34
About a month later . . .
All production servers die within an hour or so of each other.
On failures
We often model as if failures are uncorrelated.
This isn’t really true for hardware (ie same model disks), but itdefinitely is not true for software.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 18 / 34
Monitoring!
We get a graph like this:
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 19 / 34
More Monitoring!
Start rolling restarts every few weeks.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 20 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version
(we upgraded, no difference)
2 permgen? (nope, checked that)
3 mmap? (nope, disabled that a long time ago)
4 swap? (Not currently swapping)
5 Heap Fragmentation (Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version (we upgraded, no difference)
2 permgen?
(nope, checked that)
3 mmap? (nope, disabled that a long time ago)
4 swap? (Not currently swapping)
5 Heap Fragmentation (Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version (we upgraded, no difference)
2 permgen? (nope, checked that)
3 mmap?
(nope, disabled that a long time ago)
4 swap? (Not currently swapping)
5 Heap Fragmentation (Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version (we upgraded, no difference)
2 permgen? (nope, checked that)
3 mmap? (nope, disabled that a long time ago)
4 swap?
(Not currently swapping)
5 Heap Fragmentation (Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version (we upgraded, no difference)
2 permgen? (nope, checked that)
3 mmap? (nope, disabled that a long time ago)
4 swap? (Not currently swapping)
5 Heap Fragmentation
(Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
January: reduced cached mem; resident set size growth
Armed with graphs we started posting on cassandra-users.
1 Hotspot version (we upgraded, no difference)
2 permgen? (nope, checked that)
3 mmap? (nope, disabled that a long time ago)
4 swap? (Not currently swapping)
5 Heap Fragmentation (Well that’s interesting, have fun with jemalloc)
This smelled like a JVM/glibc/kernel bug, but we are faced with the factthat it only occurs when we are running Cassandra.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 21 / 34
Tangent: Rolling restarts and caches
Refresher on caches:
key cache: Caches location of keys
row cache: Caches entire rows
Also, the OS page cache
Cassandra can persist the entire key cache, and can persist the row keysfor the row cache, but not the rows themselves.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 22 / 34
Tangent: Rolling restarts and caches, shoot ourselves inthe foot
Before cache savings:
Size row cache it get best hit rate vs heap size trade-off
Restart node.
Node can’t handle reads, drops messages for a while. Not safe torestart another one until it stops.
After:
Size row cache it get best hit rate vs heap size trade-off.
Persist row cache keys
Restart node.
Wait half an hour for all row’s to be read, node now has a pile ofhinted handoffs to deal with.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 23 / 34
Tangent: Rolling restarts and caches, shoot ourselves inthe foot
Before cache savings:
Size row cache it get best hit rate vs heap size trade-off
Restart node.
Node can’t handle reads, drops messages for a while. Not safe torestart another one until it stops.
After:
Size row cache it get best hit rate vs heap size trade-off.
Persist row cache keys
Restart node.
Wait half an hour for all row’s to be read, node now has a pile ofhinted handoffs to deal with.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 23 / 34
Tangent: Rolling restarts and caches, right answer
Options:
1 Something hacky to save row values along with row keys and beinconsistent.
2 Something hacky to save a random set of row keys and hope thathelps.
3 Modify CLHM to allow traversal in hotness order.
4 Recognize that this is a sign you need more capacity.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 24 / 34
Tangent: Rolling restarts and caches, CASSANDRA-1966
Ben Manes Google Alert:
This example it would be a fair usage and justification of orderediteration. Its a trivial change, but its an enhancement I’veavoided eagerly performing until a project considers it aworthwhile feature.
1.0 will have a row cache keys to save option.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 25 / 34
CASSANDRA-2654
CASSANDRA-2654
Work around native heap leak in sun.nio.ch.Util affectingIncomingTcpConnection
Java bug #6210541
Deep in the bowels of Java NIO is a weak references cache to directbyte buffers
That’s a painfully broken design.
CASSANDRA-2654 works around it. But this isn’t really a “leak”,since eventually a full GC should clean them up.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 26 / 34
More attempts
Tried to audit the use of DirectByteBuffersI -XX:MaxDirectMemorySize
Opened a ticket with Oracle.I Has not gone anywhere yet.
Survey on the user listI No pattern among kernel, OS, hotspot, or other software versions.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 27 / 34
Hark, a Tweet!
http://twitter.com/#!/kimchy/status/90861039930970113
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 28 / 34
Java Bug 7066129
import j a v a . l ang . management . GarbageCol lectorMXBean ;import j a v a . l ang . management . ManagementFactory ;import j a v a . u t i l . L i s t ;
pub l i c c l a s s TestMemoryLeak {
pub l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws Excep t i on {wh i l e ( t rue ) {
L i s t<GarbageCol lectorMXBean> gcMxBeans = ManagementFactory . getGarbageCo l l ectorMXBeans ( ) ;f o r ( GarbageCol lectorMXBean gcMxBean : gcMxBeans ) {
( ( com . sun . management . GarbageCol lectorMXBean ) gcMxBean ) . g e t L a s tGc I n f o ( ) ;}
}}
}
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 29 / 34
CASSANDRA-2868
Several people verified that disabling the GCInspector (which callsGarbageCollectorMXBean#getLastGcInfo) keeps RSS from increasing.
There is a patch that tries to get similar data through another set ofmethods.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 30 / 34
Table of Contents
1 Introduction
2 Cassandra at Clearspring
3 Some Definitions
4 Time-line of the Hunt
5 Conclusions
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 31 / 34
Conclusions
Happy to be spending less time with Cassandra for a while.
There are bugs in Hotspot, your file system, RHEL5 and everythingelse you think is infallible.
I think page cache management is the open question right now.
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 32 / 34
Thoughts on Upcoming Cassandra changes
Very excited about the alternative SSTable format inCASSANDRA-674 and friends (type specific data compression,compressed index, row cache as row+filter, etc)
Once burned twice shy: Terrified of off heap data structures, but itlooks like we didn’t go down that path after all. (CASSANDRA-2252)
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 33 / 34
Questions?
Chris Burroughs (Clearspring) Trust but verify 2011-08-22 34 / 34