As a performance architect I get called into various production performance issues.
One of our recent production issues happened with TIBCO BuinessEvents® (BE)
Service constantly violating our Service Level Agreements (SLA) after running 10-15
hours since the last restart. If we keep the services running for longer we would see
them crash due to an “out of memory” execution. This is a typical sign of a classic
memory leak!
In this blog I’ll walk through the steps taken to analyze this issue in our production
environments, which tools were used, explaining background information on
Java Garbage Collection, as well as how this problem was resolved. I hope you find
this useful!
TIBCO memory leak in our production environment
Performance engineering is the science of discovering problem areas in applications
under varying but realistic load conditions. It is not always easy to simulate real
traffic and find all problems before going live. It is therefore advisable to determine
how to analyze performance problems not only in test but also in a real production
environment. Having the right tools installed in production allows us to analyze issue
and find root cause that are hard to simulate in testing.
The following diagram visualizes our service layout. TIBCO BE is used as a inventory
availability rule engine. TIBCO Business Works (BW) is used as data access service
and TIBCO Active Space is used as an in memory data grid.
In order to avoid any SLA breaches the operations team implemented a very typical,
yet not very elegant damage prevention: Restart the services every 10-11 hours
before the services became too slow and eventually crashed with an out-of-memory
exception.
My team was tasked to find the root cause of the increasing memory which was
believed to also slow down the application before the actual crash happened. Our goal
was to fix the memory issue once and for all.
B E S T P R A C T I C E S
About the Author
Aftab Alam is a performance engineering
SME working with various clients of Infosys
Limited since 2003. His focus areas are
optimizing performance test life cycles,
helping customers monitor and manage
software performance at various stages
of SDLC, and mentoring and enabling
resources in performance engineering and
APM solutions like Dynatrace.
AN EXCERPT FROM ABOUT:PERFORMANCE BLOG
TIBCO BuinessEvents® memory leak analysis in live production
Subscribe to the About:Performance blog here: http://apmblog.dynatrace.com/
dynatrace.com
logging some error messages. Having this end-to-end view available
makes it very easy to understand service-to-service interactions, how
SLAs and timeouts are handled and what the ripple effect of a bad
service is to the initial caller!
Step 3: Observing the “pre-cautious” restarts
The automatic restarts that were put in place to prevent a service crash
can be easily observed by looking at the same Dynatrace Process
Health Dashboard over a longer time period. We can see the restarts
happening every 10–11 hours:
Memory leak detection
After we have looked at all the memory metrics to observe current
memory behavior and the impact of high memory usage, garbage
collection and the automated restarts it is time to find the root cause of
the growing Old Generation heap space.
Memory snapshots
Dynatrace AppMon automatically captures so called Total Memory Dumps
during out of memory of exception, we were able to see all the objects that
were in memory when one of these crashes happened in the past:
Memory monitoring with Dynatrace AppMon
Our tool of choice was Dynatrace AppMon which we already used
to for live production performance monitoring of all our services in
production. Let me walk you through the steps in Dynatrace on how we
identified the memory leak and its root cause. In case you want to try it
on your own I suggest you:
1) Obtain your own Dynatrace Personal License
2) Watch the Dynatrace YouTube Tutorials
3) Read up on Java Memory Management
Step #1: Basic memory metrics monitoring
In the Dynatrace AppMon Client you start with Monitoring >
Infrastructure and then select your host server. We narrowed down the
problem by analyzing the key Java process and memory metrics which
are automatically monitored byDynatrace AppMon, which also pulls
these metrics natively through the JVMTI interface. Here is what we
found:
• No available free memory in Old Generation Heap Space
• As a result, major Garbage Collection (GC) is constantly kicked in to
try to make memory available for process execution
• High GC cycle time resulting in runtime suspensions
• High suspension time resulting into slowdown of request processing
• The slow down resulted in SLAs being missed and errors thrown by
the API gateway (no response from BE)
Step 2: Seeing ripple effect into API Gateway
Having Dynatrace AppMon installed on all our services allows us
to fully leverage Dynatrace PurePath Technology. The following
screenshot of a PurePaths shows how the API Gateway service is
trying to reach TIBCO BE. TIBCO BE is however struggling with Garbage
Collection Suspension and therefore triggers the internal SLA which
causes the API Gateway to thrown an error back to the caller as well as
Easy to identify cause and impact by looking at the Dynatrace Process Health Dashboard and all key health indicators: Memory, Garbage Collection, CPU, Threads and Passing Transactions.
The Dynatrace PurePath makes it easy to understand ripple effects of bad service-to-service call chains.
Observing the impact of the JVM restarts. Memory goes down — but we also see that the JVM was obviously not available for a short period (Alerts in GC Suspension)
dynatrace.com
We are seeing same pattern with other be.gen instances as
well as shown in the following screenshot where we walked the
reference tree for a single be.gen instance all the way to the TIBCO
DefaultDistributionCacheBasedStore:
Analyzing the TIBCO cache object creation behavior
Based on the findings in the reference tree we checked how we are
using the DefaultDistributedCacheBasedStore. It seems we are
creating ConcurrentHashMap for each be.gen object and keep adding
it to DefaultDistributedCacheBasedStore using CacheConceptHandler.
These ConcurrentHashMap objects are never garbage collected and
therefore fill up the heap.
To identify where our code is creating these objects we use the
Dynatrace Memory Sensor Rule feature for our be.gen.xx classes. This
feature allows us to selective monitor the creation and life time of these
objects on the heap. The following screenshot shows the output of a so
called Selective Memory Dump taken after 1 hr 55 mins of JVM restart:
After 1h 55mins we see 1.3 million objects on the heap. None of the
objects created during that timeframe were ever collected by the GC —
which explains our memory leak.
Solution
Based on the collected information from our production environment
we did code walk-throughs with the developers and found many
places where ConcurrentHashMap was not set to null. We also opened
support tickets with TIBCO Professional Services to understand why
concept.clear was not cleaning up objects even though our developers
expected it. TIBCO Support suggested deleting the unused cache
concept instances and assign Hashmaps to null even after using
HashMap.clear API of TIBCO BE.
A good practice is to focus on custom or framework classes, e.g:
be.gen.xx instead of focusing on Java Runtime classes such as Strings.
Why? Because these custom classes most likely reference lots of these
Strings and are therefore to blame for high memory consumption.
Dynatrace AppMon trending memory snapshot
Besides the Total Memory Dumps it is recommended to take so called
Trending Memory Dumps over a longer time period. These Trending
Dumps give insights into which objects are really growing over time and
are the main reason for a growing heap. In our case most of the objects
in the be.gen. package are growing:
Dynatrace AppMon reference tree analysis
Knowing which classes are growing allows us now to go back to the
Total Memory Dump to see who is holding references to these objects
and with that blocking them from being garbage collected:
Be.gen.xx is custom code that refer to com.tibco. we started our analysis on these objects.
Comparing Trending Memory Snapshots in Dynatrace lets us identify which classes are constantly growing over time
Dynatrace Selective Memory Dumps give information about who created these leaking objects and their age.
Dynatrace AppMon allows us to walk the reference tree to find out who is holding references to our be.gen.xxx classes
See full reference tree visibility for every object instance including information on which fields are referencing these objects.
Dynatrace AppMon can automatically take Memory Dumps upon Out of Memory Exceptions. The dump tells us how many objects per class are on the heap and how much memory they consume at the time the dump is taken
Learn more at dynatrace.com
After code changes were functionally tested, we decided to apply this fix on one of
the production servers and take that server out of auto JVM recycle process we had
in place.
We waited for 48 hours to see if the server remained healthy or if it still suffered the
increased memory behavior. We were delighted when we watched the Dynatrace
Processes Health Dashboard to see that our cached objects were correctly promoted to
Old Generation over the course of several hours. When Old Generation reached a critical
point, GC kicked in and correctly cleared out all the memory without running into an Out
Of Memory Exception:
Dynatrace Digital Performance Platform — its digital business…transformed.
Successfully improve your user experiences, launch new initiatives with confidence,
reduce operational complexity and go to market faster than your competition. With the
world’s most complete, powerful and flexible digital performance platform for today’s
digital enterprises, Dynatrace has you covered.
Dynatrace is the innovator behind the industry’s premier Digital Performance Platform, making real-time information about digital performance visible and actionable for everyone across business and IT. We help customers of all sizes see their applications and digital channels through the lens of their end users. Over 8,000 organizations use these insights to master complexity, gain operational agility and grow revenue by delivering amazing customer experiences.
8.22.16 722_FS_jw