+ All Categories
Home > Documents > Troubleshooting Core Dumps

Troubleshooting Core Dumps

Date post: 02-Jun-2018
Category:
Upload: aravindant11
View: 277 times
Download: 2 times
Share this document with a friend

of 15

Transcript
  • 8/11/2019 Troubleshooting Core Dumps

    1/15

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.1

    Troubleshooting Core Dumps

    Core dumps occur when a Linux process experiences a fault. This results in an outage of

    the affected process or service. The process or service must restart to recover. During

    these incidents, the server may remain up, but certain services may experience a briefoutage. This document covers troubleshooting core dumps and backtraces that may occur

    on Communications Manager(CM or CUCM), Unity Connection(UC), Cisco Emergency

    Responder(CER), Cisco Unified Presence Server(CUPS), Cisco Unified Contact Center

    Express (UCCX or IPCC Express), or any product based on Cisco's Voice Operating System

    (VOS) appliance model.

    Identifying Core Dump Eventson page 1

    Listing Core Dump Fileson page 3

    Performing Core Analysison page 4

    Understanding the Backtrace of a Core Fileon page 6

    Example 1: File Size Limit Exceededon page 6

    Example 2: Core when memory leak reaches maximum process memory

    sizeon page 6

    Example 3: Core Stack Corruptionon page 8

    Cisco Bug Toolkit Searchon page 9

    Troubleshooting Intentional Abortson page 12

    Useful Information for Creating TAC Service Requestson page 15

    Identifying Core Dump Events

    Two of the most common ways in which to identify the occurrence of a core dump in

    CallManager are the following:

    CoreDumpFileFound RTMT alert messages found in Alert Central

  • 8/11/2019 Troubleshooting Core Dumps

    2/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.2

    Within RTMT Alert Central, more detail on the specific application that generated the

    core can be found by right-clicking

    on the alert selection. An example of the core dump alert details information can be

    found below:

    Application Event log alert messages indicating a core dump has occurred:

    May 15 05:32:09 ccm-pub local7 2 : 0: May 15 09:32:08.865 UTC :

    %CCM_LPM-LPMTCT-2-CoreDumpFileFound: The new core dump file(s) have been found in the

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10519/core_alert_central_details.JPGhttps://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10518/core_alert_central.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    3/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.3

    system. TotalCoresFound:1 CoreDetails:The following lists up to 6 cores dumped by

    corresponding applications. Core1:Cisco CallManager (core.10499.6.ccm.1273915815) App

    ID:Cisco Log Partition Monitoring Tool Cluster ID: Node ID: ccm-pub

    May 15 05:32:15 ccm-pub local7 2 : 138: May 15 09:32:15.231 UTC : %CCM_RTMT-RTMT-2-RTMT-

    ERROR-ALERT: RTMT Alert Name:CoreDumpFileFound Detail:CoreDumpFileFound TotalCoresFound :

    1 CoreDetails: The following lists up to 6 coresdumped by corresponding applications. Core1 : Cisco

    CallManager(core.10499.6.ccm.1273915815) AppID : Cisco Log Partition Monitoring Tool ClusterID : NodeID :

    ccm-pub . The alarm is generated on Sat May 15 05:32:08 EDT 2010. AppID:Cisco AMC Service Cluster ID:

    Node ID:ccm-pub

    Listing Core Dump Files

    On the CallManager server in question, a list of core dumps can be obtained by issuing the

    following command:

    utils core list (CallManager version 5.x, 6.x)

    utils core active list (CallManager version 7.x and later)

    An example of 'utils core list' is provided below, where we observe the CCM service as the

    core dump generator:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10553/core_list.JPGhttps://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10553/core_list.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    4/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.4

    Output of the command 'utils core active list' is similar to the example depicted above, with

    the exception of the inclusion of the "active" parameter. This parameter was added in later

    CallManager releases to allow core file listing from the CM Inactive partition (previous CM

    version on the system, if an upgrade has taken place) without the need to perform a version

    switch and reboot. Instead of supplying "active" as the command line parameter, inactive

    partition core file listing is performed via 'utils core inactive list'.

    An example of 'utils core active list' is provided below:

    Performing Core Analysis

    Once the core dump instance has been identified via the list command, the next step

    is to obtain the core file backtrace for review. This function is provided by the following

    command:

    utils core analyze (CallManager version 5,x, 6.x)

    utils core active analyze (CallManager version 7.x and later)

    An example of the 'utils core anayze' command is provided below, where we are supplying a

    ccm service core file that was generated on 11/30/2009 at 11:11:50:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10563/core_active_list.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    5/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.5

    Like the 'utils core active list' command, one can also perform core file analysis on the

    inactive partition via the 'utils core inactive analyze ' command. This

    feature is available in CallManager 7.x and later, and a screenshot of the 'utils core active

    analyze' command is provided below:

    In both examples, a warning is provided stating that this procedure will take a considerable

    amount of I/O and may impact system performance. During the analysis process, the raw

    core file is parsed and interpreted into a backtrace output that can be used to identify the

    cause of the core dump.

    The analysis process normally takes a minute or less to complete on average. The warning

    about impact to system performance is a suggestion to run this command during a non-peak

    time period to avoid a potential resource issue.

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10564/core_active_analyze.JPGhttps://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10546/core_analyze.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    6/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.6

    Understanding the Backtrace of a Core File

    The chief component of the core analysis process is retrieving the backtrace for review.

    Once the analysis command has been executed, a section titled "Backtrace" will be

    displayed on the command line, similar to the screenshot below:

    The core backtrace output is composed of several process calls, denoted by #0, #1, #2,

    etc. These lines indicate process calls stored in memory at the time of the service fault. In

    many cases, these backtrace signatures are a unique fingerprint that can identify a particular

    known or new defect in CallManager.

    Example 1: File Size Limit Exceeded

    Core was generated by `/usr/local/cm/bin/ccm'.Program terminated with signal 25, File size limit exceeded.#0 0x006

    In this example the process was attempting to write to a file. The write attempt generted an

    exception and generated a core file. The cause, "Program terminated with signal 25, File

    size limit exceeded." is a direct match to

    CSCsu94937 Multiple services core dumping with signal 25, File size limit exceeded.

    Example 2: Core when memory leak reaches maximum process memory size

    Memory leak in CCM process, resulting in intentional abort.

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10554/core_backtrace.pnghttp://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsu94937https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10554/core_backtrace.png
  • 8/11/2019 Troubleshooting Core Dumps

    7/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.7

    #0 0x00a157a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2

    #1 0x01276825 in raise () from /lib/tls/libc.so.6

    #2 0x01278289 in abort () from /lib/tls/libc.so.6

    #3 0x0050d58b in __gnu_cxx::__verbose_terminate_handler () from /usr/local/cm/lib/

    libstlport.so.5.1

    #4 0x0050b2a1 in __cxxabiv1::__terminate () from /usr/local/cm/lib/libstlport.so.5.1

    #5 0x0050b2d6 in std::terminate () from /usr/local/cm/lib/libstlport.so.5.1

    #6 0x0050b41f in __cxa_throw () from /usr/local/cm/lib/libstlport.so.5.1

    #7 0x0050b86c in operator new () from /usr/local/cm/lib/libstlport.so.5.1

    #8 0x0a06bb2d in SdlProcessBase::operator new (size=102700) at

    SdlProcessBase.cpp:105

    #9 0x0a0014e2 in H245SessionManager::create (parentId={mSdlProcessName

    = 0x0, mSdlNodeId = 4, mSdlAppId = 100, mSdlProcessNumber = 150,

    mSdlProcessInstance = 2629}, vH245TerminalType=H245_Gateway,

    vH245TransportConnectionMode=H245Client, vH245IpAddress=404699044,

    vH245IpPort=40076, vTCPTos=96, vPassThruMSD=false, vTCSTimeout=10,

    vFastStartInd=0, vFsAudioOutgoingLCN=0, vFsAudioIncomingLCN=0,

    pktCaptureContext=0xbffab74d "", allowTCPKeepAlivesForH323=true) atProcessH245SessionManager.cpp:221

    #10 0x08a5629c in H245Interface::start_Transition (this=0xbff99008, s=@0x5c70990) at /

    vob/ccm/Common /Include/Sdl/SdlProcessBase.hpp:123

    #11 0x08a99354 in H245Interface::fireSignal (this=0xbff99008, sdlSignal=@0x5c70990) at /

    vob/ccm/Common /Include/Sdl/SdlProcessBase.hpp:175

    #12 0x0a06c904 in SdlProcessBase::inputSignal (this=0xbff99008, rSignal=0x5c70990,

    traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0,lowPriority=0, veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at

    SdlProcessBase.cpp:397

    #13 0x0a0746ce in SdlRouter::callProcess (this=0xe225ac0, _sdlSignal=0x5c70990,

    _deleteSignal=@0x36b8d07, _traceType=SdlSystemLog::SignalRouterThread, _hp=0,

    _np=0, _lp=0, _vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:371

  • 8/11/2019 Troubleshooting Core Dumps

    8/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.8

    #14 0x0a0740f3 in SdlRouter::scheduler (sdlRouter=0xe225ac0) at SdlRouter.cpp:281

    #15 0x05514bd7 in ACE_OS_Thread_Adapter::invoke (this=0xfe57a30) at

    OS_Thread_Adapter.cpp:94

    #16 0x054d5087 in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137

    #17 0x00db73cc in start_thread () from /lib/tls/libpthread.so.0

    #18 0x0131a96e in clone () from /lib/tls/libc.so.6

    In this example, the CCM process cores due to a memory leak and subsequent resource

    exhaustion. Backtraces that include calls to "operator new" are typically a result of memory

    leak. The process has requested the maximum amount of memory allowed by the operating

    system so a core is forced. It is not possible to identify the specific memory leak from the

    core only to state it is result of memory leak. Other methods must be used to identify the

    source of the leak. Frequently this is possible by parsing SDL traces to identify objects that

    are "Started" or "Created" and not subsequently "Stopped". From traces the above core was

    eventually diagnosed back to:

    CSCte50152 Memory Leak in CCM due to Transient SIP Connections.

    Example 3: Core Stack Corruption

    Memory corruption results in corrupted stack with "??" characters in place of function calls.

    #0 0x4e52500a in ?? () #1 0xaffb3070 in ?? () #2 0xaffb9084 in ?? () #3 0x030dc678 in ?? () #4 0x00000000 in ?

    In this example, a memory corruption incident had ocurred that resulted in the stack

    being overwritten. In place of function calls, we observe "??" characters in its place.Unfortunately, a search against this backtrace alone will not correlate to a known defect. It

    is recommended that the corresponding service log (e.g. ccm traces, tomcat logs) and the

    complete core file be retrieved from the affected system for TAC review.

    http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCte50152
  • 8/11/2019 Troubleshooting Core Dumps

    9/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.9

    Cisco Bug Toolkit Search

    Once a backtrace has been retrieved for the core dump event, the next step is to search the

    Bug Toolkit for potential known defects. The following defect will be used for this example:

    CSCta39769 UnicastBridgeControl Causes CUCM to Crash

    #0 0x097a0850 in UnicastBridgeControl::removeConfResources (this=0x6a69f698) at

    /vob/ccm/Common/Include/CallManager/TDCLCpShares.hpp:2622

    #1 0x097ab5ae in UnicastBridgeControl::star_StationClose (this=0x6a69f698,

    s=@0x6a981938)at ProcessUnicastBridgeControl.cpp:2193

    #2 0x097bff64 in UnicastBridgeControl::fireSignal (this=0x6a69f698,

    sdlSignal=@0x6a981938) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:174

    #3 0x09e4ae58 in SdlProcessBase::inputSignal (this=0x6a69f698, rSignal=0x6a981938,

    traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0,

    lowPriority=0, veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at

    SdlProcessBase.cpp:396

    #4 0x09e52c1a in SdlRouter::callProcess (this=0xde9bcc8, _sdlSignal=0x6a981938,

    _deleteSignal=@0x324bd97, _traceType=SdlSystemLog::SignalRouterThread, _hp=0,

    _np=0,

    _lp=0, _vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:372

    #5 0x09e5263f in SdlRouter::scheduler (sdlRouter=0xde9bcc8) at SdlRouter.cpp:282

    #6 0x00a00ef3 in ACE_OS_Thread_Adapter::invoke (this=0x10b70b90) at

    OS_Thread_Adapter.cpp:94

    #7 0x009c1abf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137

    #8 0x003bf371 in start_thread () from /lib/tls/libpthread.so.0

    #9 0x01339ffe in clone () from /lib/tls/libc.so.6

    The following line will be used to perform initial searching in the Bug Toolkit:

    http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCta39769&from=summary
  • 8/11/2019 Troubleshooting Core Dumps

    10/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.10

    #2 0x097bff64 in UnicastBridgeControl::fireSignal(this=0x6a69f698,

    sdlSignal=@0x6a981938) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:174

    In the search example screenshot, unique memory location identifiers have been removed

    from the search statement to ensure that matches are found. It may also be necessary torefine the search criteria to a specific CUCM version if no matches are presented after the

    search attempt, as shown below:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10575/bug_toolkit_search.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    11/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.11

    Software version 7.0 was selected in the modified search above to narrow down to a specific

    subset of defects applicable to CUCM. With the search re-submitted for defects related to

    version 7.0, the following results are displayed:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10577/bug_toolkit_results.JPGhttps://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10576/bug_toolkit_modify_search.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    12/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.12

    Troubleshooting Intentional Aborts

    Core dumps that include the "IntentionalAbort" statement indicate a system resource issue

    that was responsible for the service fault. The following ccm service core dump backtrace

    example will be used to demonstrate steps involved in troubleshooting intentional aborts:

    #0 0x001627a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2

    #1 0x00d64815 in raise () from /lib/tls/libc.so.6

    #2 0x00d66279 in abort () from /lib/tls/libc.so.6

    #3 0x084c4e7a in preabort () at ProcessCMProcMon.cpp:101

    #4 0x084c4e92 in IntentionalAbort (reason=0xa9fdbdc "CallManager's timers appear

    incorrect. This may be due to CPU or blocked function. Attempting to restart

    CallManager.") at ProcessCMProcMon.cpp:106

    #5 0x084c66c3 in CMProcMon::verifySdlTimerServices () at ProcessCMProcMon.cpp:843

    #6 0x084c7035 in CMProcMon::callManagerMonitorThread (cmProcMon=0xec122d0) at

    ProcessCMProcMon.cpp:439

    #7 0x0107e5fb in ACE_OS_Thread_Adapter::invoke (this=0xf3ef3b8) at

    OS_Thread_Adapter.cpp:94

    #8 0x01040cbf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137

    #9 0x002dc3cc in start_thread () from /lib/tls/libpthread.so.0

    #10 0x00e061ae in clone () from /lib/tls/libc.so.6

    The RIS Data Collector Perfmonlog information should be retrieved from the CUCM node

    that experienced the core dump via RTMT for review, for the timestamp of the core dump

  • 8/11/2019 Troubleshooting Core Dumps

    13/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.13

    alert. Using Windows Performance log viewer, the process CPU utilization counters are

    reviewed first, as shown below:

    In the screenshot above, it is observed that CPU utilization appears stable prior to the

    core dump incident. CPU utilization dips during the crash as resources are released. In

    troubleshooting a potential CPU utilization issue, the concern would be a trend in CPU

    increase leading up to the core dump incident.

    The next component to examine in the Perfmon data is percentage VM used by the system.

    In the current example, it is observed that this counter is particularly high for the time periodleading up to the core dump incident:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10580/cm_core_cpu.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    14/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    is not warrantied by Cisco.14

    Next, VMSize specific to all processes are examined to determine what caused the gradual

    increase in memory utilization on the system. In this example, it was found that the VMSize

    counter for the CCM process is relatively high and sloping upwards. This indicates that

    CCM had cored due to a memory leak:

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10581/cm_core_vmsize.JPG
  • 8/11/2019 Troubleshooting Core Dumps

    15/15

    Troubleshooting Core Dumps

    Postings may contain unverified user-created content and change frequently. The content is provided as-is and

    Useful Information for Creating TAC Service Requests

    When opening a new TAC service request to troubleshoot a core dump incident, the

    following information is useful to provide to TAC to expedite the process:

    Full CallManager version in use (e.g. 7.1.3.32900-4)

    Date/Time of the core dump incident

    Application Event log is useful to provide for this information

    Provide output of 'utils core list' command for absolute timestamps and core file names

    Offending service that generated the core dump (e.g. CCM, CEF, Tomcat)

    Core file backtrace output

    Core file

    Service logs for offending process

    e.g. If a CCM core dump, provide Cisco CallManager traces for time period of incident

    RIS Data Collector Perfmonlog

    See Troubleshooting Intentional Aborts section

    https://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10582/ccm_vm_size.JPGhttps://supportforums.cisco.com/servlet/JiveServlet/showImage/102-14743-10-10582/ccm_vm_size.JPG

Recommended