+ All Categories
Home > Documents > AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get...

AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get...

Date post: 16-Feb-2018
Category:
Upload: dodien
View: 223 times
Download: 4 times
Share this document with a friend
4
As a performance architect I get called into various production performance issues. One of our recent production issues happened with TIBCO BuinessEvents® (BE) Service constantly violating our Service Level Agreements (SLA) after running 10-15 hours since the last restart. If we keep the services running for longer we would see them crash due to an “out of memory” execution. This is a typical sign of a classic memory leak! In this blog I’ll walk through the steps taken to analyze this issue in our production environments, which tools were used, explaining background information on Java Garbage Collection, as well as how this problem was resolved. I hope you find this useful! TIBCO memory leak in our production environment Performance engineering is the science of discovering problem areas in applications under varying but realistic load conditions. It is not always easy to simulate real traffic and find all problems before going live. It is therefore advisable to determine how to analyze performance problems not only in test but also in a real production environment. Having the right tools installed in production allows us to analyze issue and find root cause that are hard to simulate in testing. The following diagram visualizes our service layout. TIBCO BE is used as a inventory availability rule engine. TIBCO Business Works (BW) is used as data access service and TIBCO Active Space is used as an in memory data grid. In order to avoid any SLA breaches the operations team implemented a very typical, yet not very elegant damage prevention: Restart the services every 10-11 hours before the services became too slow and eventually crashed with an out-of-memory exception. My team was tasked to find the root cause of the increasing memory which was believed to also slow down the application before the actual crash happened. Our goal was to fix the memory issue once and for all. BEST PRACTICES About the Author Aftab Alam is a performance engineering SME working with various clients of Infosys Limited since 2003. His focus areas are optimizing performance test life cycles, helping customers monitor and manage software performance at various stages of SDLC, and mentoring and enabling resources in performance engineering and APM solutions like Dynatrace. AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO BuinessEvents ® memory leak analysis in live production Subscribe to the About:Performance blog here: http://apmblog.dynatrace.com/
Transcript
Page 1: AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get called into various production performance issues. ... TIBCO Business Works (BW) ... BEST PRACTICES

As a performance architect I get called into various production performance issues.

One of our recent production issues happened with TIBCO BuinessEvents® (BE)

Service constantly violating our Service Level Agreements (SLA) after running 10-15

hours since the last restart. If we keep the services running for longer we would see

them crash due to an “out of memory” execution. This is a typical sign of a classic

memory leak!

In this blog I’ll walk through the steps taken to analyze this issue in our production

environments, which tools were used, explaining background information on

Java Garbage Collection, as well as how this problem was resolved. I hope you find

this useful!

TIBCO memory leak in our production environment

Performance engineering is the science of discovering problem areas in applications

under varying but realistic load conditions. It is not always easy to simulate real

traffic and find all problems before going live. It is therefore advisable to determine

how to analyze performance problems not only in test but also in a real production

environment. Having the right tools installed in production allows us to analyze issue

and find root cause that are hard to simulate in testing.

The following diagram visualizes our service layout. TIBCO BE is used as a inventory

availability rule engine. TIBCO Business Works (BW) is used as data access service

and TIBCO Active Space is used as an in memory data grid.

In order to avoid any SLA breaches the operations team implemented a very typical,

yet not very elegant damage prevention: Restart the services every 10-11 hours

before the services became too slow and eventually crashed with an out-of-memory

exception.

My team was tasked to find the root cause of the increasing memory which was

believed to also slow down the application before the actual crash happened. Our goal

was to fix the memory issue once and for all.

B E S T P R A C T I C E S

About the Author

Aftab Alam is a performance engineering

SME working with various clients of Infosys

Limited since 2003. His focus areas are

optimizing performance test life cycles,

helping customers monitor and manage

software performance at various stages

of SDLC, and mentoring and enabling

resources in performance engineering and

APM solutions like Dynatrace.

AN EXCERPT FROM ABOUT:PERFORMANCE BLOG

TIBCO BuinessEvents® memory leak analysis in live production

Subscribe to the About:Performance blog here: http://apmblog.dynatrace.com/

Page 2: AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get called into various production performance issues. ... TIBCO Business Works (BW) ... BEST PRACTICES

dynatrace.com

logging some error messages. Having this end-to-end view available

makes it very easy to understand service-to-service interactions, how

SLAs and timeouts are handled and what the ripple effect of a bad

service is to the initial caller!

Step 3: Observing the “pre-cautious” restarts

The automatic restarts that were put in place to prevent a service crash

can be easily observed by looking at the same Dynatrace Process

Health Dashboard over a longer time period. We can see the restarts

happening every 10–11 hours:

Memory leak detection

After we have looked at all the memory metrics to observe current

memory behavior and the impact of high memory usage, garbage

collection and the automated restarts it is time to find the root cause of

the growing Old Generation heap space.

Memory snapshots

Dynatrace AppMon automatically captures so called Total Memory Dumps

during out of memory of exception, we were able to see all the objects that

were in memory when one of these crashes happened in the past:

Memory monitoring with Dynatrace AppMon

Our tool of choice was Dynatrace AppMon which we already used

to for live production performance monitoring of all our services in

production. Let me walk you through the steps in Dynatrace on how we

identified the memory leak and its root cause. In case you want to try it

on your own I suggest you:

1) Obtain your own Dynatrace Personal License

2) Watch the Dynatrace YouTube Tutorials

3) Read up on Java Memory Management

Step #1: Basic memory metrics monitoring

In the Dynatrace AppMon Client you start with Monitoring >

Infrastructure and then select your host server. We narrowed down the

problem by analyzing the key Java process and memory metrics which

are automatically monitored byDynatrace AppMon, which also pulls

these metrics natively through the JVMTI interface. Here is what we

found:

• No available free memory in Old Generation Heap Space

• As a result, major Garbage Collection (GC) is constantly kicked in to

try to make memory available for process execution

• High GC cycle time resulting in runtime suspensions

• High suspension time resulting into slowdown of request processing

• The slow down resulted in SLAs being missed and errors thrown by

the API gateway (no response from BE)

Step 2: Seeing ripple effect into API Gateway

Having Dynatrace AppMon installed on all our services allows us

to fully leverage Dynatrace PurePath Technology. The following

screenshot of a PurePaths shows how the API Gateway service is

trying to reach TIBCO BE. TIBCO BE is however struggling with Garbage

Collection Suspension and therefore triggers the internal SLA which

causes the API Gateway to thrown an error back to the caller as well as

Easy to identify cause and impact by looking at the Dynatrace Process Health Dashboard and all key health indicators: Memory, Garbage Collection, CPU, Threads and Passing Transactions.

The Dynatrace PurePath makes it easy to understand ripple effects of bad service-to-service call chains.

Observing the impact of the JVM restarts. Memory goes down — but we also see that the JVM was obviously not available for a short period (Alerts in GC Suspension)

Page 3: AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get called into various production performance issues. ... TIBCO Business Works (BW) ... BEST PRACTICES

dynatrace.com

We are seeing same pattern with other be.gen instances as

well as shown in the following screenshot where we walked the

reference tree for a single be.gen instance all the way to the TIBCO

DefaultDistributionCacheBasedStore:

Analyzing the TIBCO cache object creation behavior

Based on the findings in the reference tree we checked how we are

using the DefaultDistributedCacheBasedStore. It seems we are

creating ConcurrentHashMap for each be.gen object and keep adding

it to DefaultDistributedCacheBasedStore using CacheConceptHandler.

These ConcurrentHashMap objects are never garbage collected and

therefore fill up the heap.

To identify where our code is creating these objects we use the

Dynatrace Memory Sensor Rule feature for our be.gen.xx classes. This

feature allows us to selective monitor the creation and life time of these

objects on the heap. The following screenshot shows the output of a so

called Selective Memory Dump taken after 1 hr 55 mins of JVM restart:

After 1h 55mins we see 1.3 million objects on the heap. None of the

objects created during that timeframe were ever collected by the GC —

which explains our memory leak.

Solution

Based on the collected information from our production environment

we did code walk-throughs with the developers and found many

places where ConcurrentHashMap was not set to null. We also opened

support tickets with TIBCO Professional Services to understand why

concept.clear was not cleaning up objects even though our developers

expected it. TIBCO Support suggested deleting the unused cache

concept instances and assign Hashmaps to null even after using

HashMap.clear API of TIBCO BE.

A good practice is to focus on custom or framework classes, e.g:

be.gen.xx instead of focusing on Java Runtime classes such as Strings.

Why? Because these custom classes most likely reference lots of these

Strings and are therefore to blame for high memory consumption.

Dynatrace AppMon trending memory snapshot

Besides the Total Memory Dumps it is recommended to take so called

Trending Memory Dumps over a longer time period. These Trending

Dumps give insights into which objects are really growing over time and

are the main reason for a growing heap. In our case most of the objects

in the be.gen. package are growing:

Dynatrace AppMon reference tree analysis

Knowing which classes are growing allows us now to go back to the

Total Memory Dump to see who is holding references to these objects

and with that blocking them from being garbage collected:

Be.gen.xx is custom code that refer to com.tibco. we started our analysis on these objects.

Comparing Trending Memory Snapshots in Dynatrace lets us identify which classes are constantly growing over time

Dynatrace Selective Memory Dumps give information about who created these leaking objects and their age.

Dynatrace AppMon allows us to walk the reference tree to find out who is holding references to our be.gen.xxx classes

See full reference tree visibility for every object instance including information on which fields are referencing these objects.

Dynatrace AppMon can automatically take Memory Dumps upon Out of Memory Exceptions. The dump tells us how many objects per class are on the heap and how much memory they consume at the time the dump is taken

Page 4: AN EXCERPT FROM ABOUT:PERFORMANCE BLOG TIBCO · PDF fileAs a performance architect I get called into various production performance issues. ... TIBCO Business Works (BW) ... BEST PRACTICES

Learn more at dynatrace.com

After code changes were functionally tested, we decided to apply this fix on one of

the production servers and take that server out of auto JVM recycle process we had

in place.

We waited for 48 hours to see if the server remained healthy or if it still suffered the

increased memory behavior. We were delighted when we watched the Dynatrace

Processes Health Dashboard to see that our cached objects were correctly promoted to

Old Generation over the course of several hours. When Old Generation reached a critical

point, GC kicked in and correctly cleared out all the memory without running into an Out

Of Memory Exception:

Dynatrace Digital Performance Platform — its digital business…transformed.

Successfully improve your user experiences, launch new initiatives with confidence,

reduce operational complexity and go to market faster than your competition. With the

world’s most complete, powerful and flexible digital performance platform for today’s

digital enterprises, Dynatrace has you covered.

Dynatrace is the innovator behind the industry’s premier Digital Performance Platform, making real-time information about digital performance visible and actionable for everyone across business and IT. We help customers of all sizes see their applications and digital channels through the lens of their end users. Over 8,000 organizations use these insights to master complexity, gain operational agility and grow revenue by delivering amazing customer experiences.

8.22.16 722_FS_jw


Recommended