Cloud Native Diameter Routing Agent (CNDRA) User Guide22222 - Long Timeout PTR Buffer Pool...

Oracle® CommunicationsCloud Native Diameter Routing Agent(CNDRA) User Guide

Release 1.6F31209-02May 2020

Oracle Communications Cloud Native Diameter Routing Agent (CNDRA) User Guide, Release 1.6

F31209-02

Copyright © 2019, 2020, Oracle and/or its affiliates.

This software and related documentation are provided under a license agreement containing restrictions onuse and disclosure and are protected by intellectual property laws. Except as expressly permitted in yourlicense agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify,license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means.Reverse engineering, disassembly, or decompilation of this software, unless required by law forinteroperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. Ifyou find any errors, please report them to us in writing.

If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it onbehalf of the U.S. Government, then the following notice is applicable:

U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software,any programs embedded, installed or activated on delivered hardware, and modifications of such programs)and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government endusers are "commercial computer software" or “commercial computer software documentation” pursuant to theapplicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use,reproduction, duplication, release, display, disclosure, modification, preparation of derivative works, and/oradaptation of i) Oracle programs (including any operating system, integrated software, any programsembedded, installed or activated on delivered hardware, and modifications of such programs), ii) Oraclecomputer documentation and/or iii) other Oracle data, is subject to the rights and limitations specified in thelicense contained in the applicable contract. The terms governing the U.S. Government’s use of Oracle cloudservices are defined by the applicable contract for such services. No other rights are granted to the U.S.Government.

This software or hardware is developed for general use in a variety of information management applications.It is not developed or intended for use in any inherently dangerous applications, including applications thatmay create a risk of personal injury. If you use this software or hardware in dangerous applications, then youshall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure itssafe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of thissoftware or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks oftheir respective owners.

Intel and Intel Inside are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks areused under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Epyc,and the AMD logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registeredtrademark of The Open Group.

This software or hardware and documentation may provide access to or information about content, products,and services from third parties. Oracle Corporation and its affiliates are not responsible for and expresslydisclaim all warranties of any kind with respect to third-party content, products, and services unless otherwiseset forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not beresponsible for any loss, costs, or damages incurred due to your access to or use of third-party content,products, or services, except as set forth in an applicable agreement between you and Oracle.

Contents

1 Introduction

References 1-1

Acronym and Terminologies 1-1

My Oracle Support 1-2

2 CNDRA Architecture

3 CNDRA Supported Features

CNDRA Disaster Recovery 3-1

Multiple Deployment Support 3-2

Mediation Support 3-3

4 CNDRA Metrics, KPIs, and Alerts

5 CNDRA Alarms and Events

Diameter Alarms and Events (8000-8299, 22000-22350, 22900-22999,25600-25899) 5-11

8000 - MpEvFsmException 5-11

8000 - 001 - MpEvFsmException_SocketFailure 5-11

8000 - 002 - MpEvFsmException_BindFailure 5-12

8000 - 003 - MpEvFsmException_OptionFailure 5-12

8000 - 004 - MpEvFsmException_AcceptorCongested 5-13

8000 - 101 - MpEvFsmException_ListenFailure 5-13

8000 - 102 - MpEvFsmException_PeerDisconnected 5-14

8000 - 103 - MpEvFsmException_PeerUnreachable 5-14

8000 - 104 - MpEvFsmException_CexFailure 5-15

8000 - 105 - MpEvFsmException_CerTimeout 5-16

8000 - 106 - MpEvFsmException_AuthenticationFailure 5-16

8000 - 201 - MpEvFsmException_UdpSocketLimit 5-17

iii

8001 - MpEvException 5-17

8001 - 001 - MpEvException_Oversubscribed 5-17

8002 - MpEvRxException 5-18

8002 - 001 - MpEvRxException_DiamMsgPoolCongested 5-18

8002 - 002 - MpEvRxException_MaxMpsExceeded 5-18

8002 - 003 - MpEvRxException_CpuCongested 5-19

8002 - 004 - MpEvRxException_SigEvPoolCongested 5-20

8002 - 005 - MpEvRxException_DstMpUnknown 5-20

8002 - 006 - MpEvRxException_DstMpCongested 5-21

8002 - 007 - MpEvRxException_DrlReqQueueCongested 5-21

8002 - 008 - MpEvRxException_DrlAnsQueueCongested 5-22

8002 - 009 - MpEvRxException_ComAgentCongested 5-22

8002 - 201 - MpEvRxException_MsgMalformed 5-23

8002 - 202 - MpEvRxException_PeerUnknown 5-23

8002 - 204 - MpEvRxException_ItrPoolCongested 5-24

8002 - 207 - MpEvRxException_ReqDuplicate 5-24

8003 - MpEvTxException 5-26

8003 - 001 - MpEvTxException_ConnUnknown 5-26

8003 - 101 - MpEvTxException_DclTxTaskQueueCongested 5-26

8003 - 202 - MpEvTxException_EtrPoolCongested 5-27

8004 - EvFsmAdState 5-27

8004 - 001 - EvFsmAdState_StateChange 5-27

8005 - EvFsmOpState 5-28

8005 - 001 - EvFsmOpState_StateChange 5-28

8006 - EvFsmException 5-29

8006 - 001 - EvFsmException_DnsFailure 5-29

8006 - 002 - EvFsmException_ConnReleased 5-29

8006 - 101 - EvFsmException_SocketFailure 5-30

8006 - 102 - EvFsmException_BindFailure 5-31

8006 - 103 - EvFsmException_OptionFailure 5-31

8006 - 104 - EvFsmException_ConnectFailure 5-32

8006 - 105 - EvFsmException_PeerDisconnected 5-32

8006 - 106 - EvFsmException_PeerUnreachable 5-33

8006 - 107 - EvFsmException_CexFailure 5-33

8006 - 108 - EvFsmException_CeaTimeout 5-34

8006 - 109 - EvFsmException_DwaTimeout 5-35

8006 - 110 - EvFsmException_DwaTimeout 5-35

8006 - 111 - EvFsmException_ProvingFailure 5-36

8006 - 112 - EvFsmException_WatchdogFailure 5-36

8006 - 113 - EvFsmException_AuthenticationFailure 5-37

8007 - EvException 5-37

iv

8007 - 101 - EvException_MsgPriorityFailure 5-37

8008 - EvRxException 5-38

8008 - 001 - EvRxException_MaxMpsExceeded 5-38

8008 - 101 - EvRxException_MsgMalformed 5-38

8008 - 102 - EvRxException_MsgInvalid 5-39

8008 - 202 - EvRxException_MsgAttrLenUnsupported 5-39

8008 - 203 - EvRxException_MsgTypeUnsupported 5-40

8008 - 204 - EvRxException_AnsOrphaned 5-40

8008 - 205 - EvRxException_AccessAuthMissing 5-41

8008 - 206 - EvRxException_StatusAuthMissing 5-41

8008 - 207 - EvRxException_MsgAuthInvalid 5-42

8008 - 208 - EvRxException_ReqAuthInvalid 5-43

8008 - 209 - EvRxException_AnsAuthInvalid 5-43

8008 - 210 - EvRxException_MsgAttrAstUnsupported 5-44

8008 - 212 - EvRxException_MsgTypeMissingMccs 5-44

8008 - 213 - EvRxException_ConnUnavailable 5-45

8009 - EvTxException 5-45

8009 - 001 - EvTxException_ConnUnavailable 5-45

8009 - 101 - EvTxException_DclTxConnQueueCongested 5-46

8009 - 102 - EvTxException_DtlsMsgOversized 5-46

8009 - 201 - EvTxException_MsgAttrLenUnsupported 5-47

8009 - 202 - EvTxException_MsgTypeUnsupported 5-47

8009 - 203 - EvTxException_MsgLenInvalid 5-48

8009 - 204 - EvTxException_ReqOnServerConn 5-48

8009 - 205 - EvTxException_AnsOnClientConn 5-49

8009 - 206 - EvTxException_DiamMsgMisrouted 5-50

8009 - 207 - EvTxException_ReqDuplicate 5-50

8009 - 208 - EvTxException_WriteFailure 5-51

8010 - MpIngressDrop 5-51

8011 - EcRate 5-52

8012 - MpRxNgnPsOfferedRate 5-53

8013 - MpNgnPsStateMismatch 5-54

8014 - MpNgnPsDrop 5-55

8015 - NgnPsMsgMisrouted 5-56

8016 - MpP16StateMismatch 5-56

8017 - MpTaskCpuCongested 5-57

8018 - P16MsgMisrouted 5-58

8019 - MpAnswerPriorityModeMismatch 5-58

8020 - MpRoutingThreadPoolStateMismatch 5-59

8100 - NormMsgMisrouted 5-59

8101 - DiagMsgMisrouted 5-60

v

22001 - Message Decoding Failure 5-60

22002 - Peer Routing Rules with Same Priority 5-61

22003 - Application ID Mismatch with Peer 5-61

22004 - Maximum pending transactions allowed exceeded 5-62

22005 - No peer routing rule found 5-63

22007 - Inconsistent Application ID Lists from a Peer 5-64

22008 - Orphan Answer Response Received 5-65

22009 - Application Routing Rules with Same Priority 5-65

22010 - Specified DAS Route List not provisioned 5-66

22014 - No DAS Route List specified 5-67

22012 - Specified MCCS not provisioned 5-67

22016 - Peer Node Alarm Aggregation Threshold 5-68

22017 - Route List Alarm Aggregation Threshold 5-69

22013 - DAS Peer Number of Retransmits Exceeded for Copy 5-70

22018 - Maintenance Leader HA Notification to go Active 5-70

22019 - Maintenance Leader HA Notification to go OOS 5-71

22020 - Copy Message size exceeded the system configured size limit 5-71

22021 - Debug Routing Info AVP Enabled 5-72

22022 - Forwarding Loop Detected 5-73

22051 - Peer Unavailable 5-73

22052 - Peer Degraded 5-75

22053 - Route List Unavailable 5-76

22054 - Route List Degraded 5-77

22055 - Non-Preferred Route Group in Use 5-78

22056 - Connection Admin State Inconsistency Exists 5-79

22062 - Actual Host Name cannot be determined for Topology Hiding 5-80

22063 - Diameter Max Message Size Limit Exceeded 5-80

22064 - Upon receiving Redirect Host Notification the Request has not beensubmitted for re-routing 5-81

22065 - Upon receiving Redirect Realm Notification the Request has not beensubmitted for re-routing 5-81

22071 - TtgEvLossChg 5-82

22075 - Message is not routed to Application 5-83

22077 - Excessive Request Reroute Threshold Exceeded 5-83

22078 - Loop or Maximum Depth Exceeded in ART or PRT Search 5-84

22101 - Connection Unavailable 5-85

22102 - Connection Degraded 5-86

22105 - Connection Transmit Congestion 5-89

22106 - Ingress Message Discarded: DraWorker Ingress MessageRate Control 5-90

22200 - MP CPU Congested 5-91

22201 - MpRxAllRate 5-92

22202 - MpDiamMsgPoolCongested 5-93

vi

22203 - PTR Buffer Pool Utilization 5-93

22204 - Request Message Queue Utilization 5-94

22205 - Answer Message Queue Utilization 5-95

22206 - Reroute Queue Utilization 5-96

22207 - DclTxTaskQueueCongested 5-96

22208 - DclTxConnQueueCongested 5-97

22209 - Message Copy Disabled 5-98

22214 - Message Copy Queue Utilization 5-98

22221 - Routing MPS Rate 5-99

22222 - Long Timeout PTR Buffer Pool Utilization 5-100

22223 - DraWorker Memory Utilization Threshold Crossed 5-100

22224 - Average Hold Time Limit Exceeded 5-101

22225 - Average Message Size Limit Exceeded 5-104

22328 - Connection is processing a higher than normal ingress messaging rate 5-105

22350 - Fixed Connection Alarm Aggregation Threshold 5-107

22900 - DPI DB Table Monitoring Overrun 5-109

22901 - DPI DB Table Monitoring Error 5-109

22950 - Connection Status Inconsistency Exists 5-110

22961 - Insufficient Memory for Feature Set 5-111

25612 - Peer CNDRA ping failed 5-112

25613 – Peer Node Alarm Group Threshold 5-112

25614 - Connection Alarm Group Threshold 5-113

25806 - Invalid Internal Overseer Server Group Designation 5-113

Range Based Address Resolution (RBAR) Alarms and Events (22400-22424) 5-114

22400 - Message Decoding Failure 5-114

22401 - Unknown Application ID 5-115

22402 - Unknown Command Code 5-115

22403 - No Routing Entity Address AVPs 5-116

22404 - No valid Routing Entity Addresses found 5-116

22405 - Valid address received didn’t match a provisioned address or addressrange 5-117

22406 - Routing attempt failed due to internal resource exhaustion 5-118

22407 - Routing attempt failed due to internal database inconsistency failure 5-118

22411 - Address Range Lookup for Local Identifier skipped 5-119

Generic Application Alarms and Events (22500-22599) 5-119

22500 - Peer CNDRA Application Unavailable 5-119

22501 - Peer CNDRA Application Degraded 5-121

22502 - Peer CNDRA Application Request Message Queue Utilization 5-122

22503 - Peer CNDRA Application Answer Message Queue Utilization 5-124

22504 - Peer CNDRA Application Ingress Message Rate 5-125

22520 - Peer CNDRA Application Enabled 5-126

vii

22521 - Peer CNDRA Application Disabled 5-126

A CNE modification for CNDRA Alerting and SNMP Integration

viii

What's New in This Release

The following new features are introduced in Cloud Native Diameter RoutingAgent 1.6.0

• Mediation support specific infrastructure

• IP-pool configuration per dra deployment for multi-deployment

• Connection support upto 12,000 (1000 connection per pod)

• Redirect Agent Support

9

List of Figures

2-1 CNDRA deployment components 2-1

x

1Introduction

The CNDRA (Cloud Native Diameter Routing Agent) is a Diameter Routing solution forthe Cloud Native Environment.

This feature allows the deployment of the CNDRA on Bare Metal CNE environment.

CNDRA supports the following:

• Initial topology deployment using Helm Job.

• Monitoring, Auditing and Configuration of overseer and dra pod through KeeperService.

• MMI support for Diameter configuration.

• Observing capability through CNE infrastructure component, such as EFK.

• Support Initiator, Responder connection over TCP.

– Support for responder connection to be established on any dra pod in thetopology.

• Application metrics presentation using Prometheus.

• Visualization of time series data for infrastructure and application analytics usingGrafana.

• Incoming connection requests from peer's gets distributed using external loadbalancer such as BGP based Top-of-rack switch (Cluster level) and subsequentlyby Kube-proxy (Node level) to a dra pod in the topology.

• Initiator connection distribution support using round-robin mechanism to one of thedra pods available in the topology.

• Initiator connections from same peer is distributed among different dra pods.

• Inter dra pod communication over TCP connection for signaling message.

• Comcol replicate/merge the configuration data over TCP connection.

ReferencesThis section includes reference to documents that can be referred for more informationand details.

• CNE Installation Guide

• CNDRA Installation Guide

• CNDRA Alarms and KPIs Reference

Acronym and TerminologiesThe following table contains acronyms and terminologies used within the document.

1-1

Acronyms Terminologies

Alerts Alerts are notifications identifying specific issues

Alert Manager It handles alerts sent by client applications such as thePrometheus server

CNDRA Cloud Native Diameter Routing Agent

CNE Cloud Native Environment

Dra pod Diameter Routing Agent

EFK Elastic Search, Fluentd and Kibana stack

Helm Job Helm install job pod

Keeper Monitoring service for overseer and dra pods

MetalLB MetalLB is a load-balancer implementation for bare metalkubernetes clusters

MMI Machine to Machine Interface

OverSeer Pod which supervises Comcol topology configuration, especiallyworkers

Prometheus An open-source monitoring system for cloud applications

RBAR Range Based Address Resolution

SNMP Notifier SNMP notifier receives alerts, and send them as SNMP traps tothe SNMP Manager

My Oracle SupportMy Oracle Support (https://support.oracle.com) is your initial point of contact for allproduct support and training needs. A representative at Customer Access Support canassist you with My Oracle Support registration.

Call the Customer Access Support main number at 1-800-223-1711 (toll-free in theUS), or call the Oracle Support hotline for your local country from the list at http://www.oracle.com/us/support/contact/index.html. When calling, make the selections inthe sequence shown below on the Support telephone menu:

1. Select 2 for New Service Request.

2. Select 3 for Hardware, Networking and Solaris Operating System Support.

3. Select one of the following options:

• For Technical issues such as creating a new Service Request (SR), select 1.

• For Non-technical issues such as registration or assistance with My OracleSupport, select 2.

You are connected to a live agent who can assist you with My Oracle Supportregistration and opening a support ticket.

My Oracle Support is available 24 hours a day, 7 days a week, 365 days a year.

Chapter 1My Oracle Support

1-2

https://support.oracle.com

http://www.oracle.com/us/support/contact/index.html

http://www.oracle.com/us/support/contact/index.html

2CNDRA Architecture

This sections includes information about the CNDRA deployment model.

The CNDRA architecture diagram below illustrates the deployment components:

Figure 2-1 CNDRA deployment components

CNDRA Topology configuration

Topology configuration is performed by Java client running as helm job.

• Once the job is done, it is removed from the deployment.

• The job is recoverable from failure, upon restart it continues from the last state.

• Topology configuration include:

– Network Element (NE)

– Server Configuration

– Server Group Configuration

Diameter Configuration

Diameter configuration is performed by MMI command using any HTTP/REST client,such as Postman and CURL.

Diameter Configuration include:

• Local Node configuration

• Peer Node configuration (HSS, MME)

2-1

• Responder Connections configuration

• Initiator Connections configuration

• Connection Admin State

• Capacity configuration sets

• Add route group

• Add route list

• Add peer route table

• Application configuration such as RBAR

MMI References

• CNDRA MMI is a RESTful (Representational State Transfer) interface thatprovides access to a broad range of Operations, Administration, and Maintenance(OAM) services that clients use to configure and manage the CNDRA.

• Whenever the pods are deployed in Kubernetes clusters, one pod is the leaderthat must receive an MMI request.

• The CNDRA configuration is managed through Create, Retrieve, Update, andDelete (CRUD) actions on instances of the various resource types built into theCNDRA.

• The CRUD operations are implemented using standard HTTP verbs:

– HTTP POST is used to Create a new resource instance.

– HTTP GET is used to Retrieve one or more resource instances.

– HTTP PUT is used to Update a resource instance.

– HTTP DELETE is used to Delete a resource instance.

Elasticsearch, Fluentd, Kibana (EFK)

EFK framework has the following three components:

• Elasticsearch provides storage for the collected data and has the following threecomponents:

– client: runs on each worker node and provides access for data collector.

– master: runs on each node and manages all the elastic search pods.

– data: runs on each worker node and is responsible for storing the data indocument view schema format.

• Fluentd runs on each worker node and is responsible to collect the data and passit to elastic search.

• Kibana runs on any one worker node and is responsible to get data from elasticsearch and display it on kibana GUI.

Prometheus and Grafana

• Prometheus is used to retrieve and view application metrics.

• Prometheus is an open source, metrics-based monitoring system.

• Prometheus has a simple yet powerful data model and a query language thatallow us to analyze the performance of the applications and the infrastructure.

Chapter 2

2-2

• Prometheus is integrated with many common service discovery mechanism, suchas kubernetes, EC2 and Consul.

• Prometheus discovers targets to scrape (PULL) from service discovery.

• Prometheus HTTP GET request to scrape (PULL) metrics.

• Grafana is an open source metric analytics and visualization suite.

• It is most commonly used for visualizing time series data for infrastructure andapplication analytics.

CNDRA Disaster Recovery

The CNDRA supports two pods of the following flavour:

• overseer

• draworker

Overseer

The overseer pods are mainly responsible for operation and management of theproduct. The overseer pods manages the topology and all the application levelconfiguration data. In CNDRA, the overseer maintains all the configuration.

DraWorker

The draworker is mainly responsible for managing the Diameter traffic (signaling). Thedraworker pods always gets the latest configuration data from overseer via comcolreplication channels.

The CNDRA 1.5.0 release supports four overseer pods in deployment instead of two inprevious releases, to enable Disaster recovery.

Out of the four overseer pods in CNDRA topology, one is ACTIVE, one is STANDBYand two are in SPARE state. At the time of Disaster Recovery, when both the Activeand Standby Overseer comes down, the remaining two SPARE pods takes up the roleof ACTIVE and STANDBY Overseer, whereas two new overseer pods are added bythe Keeper in SPARE state.

Chapter 2

2-3

3CNDRA Supported Features

CNDRA supports the following features:

CNDRA Disaster RecoveryThe CNDRA is a Cloud Native DRA solution. The product is installed or deployed inKubernetes cluster, using helm charts. During initial deployment, the CNDRA has ahelper job pod, called as init-config. This init-config job is the Kubernetes client thatscans and configures the deployed CNDRA topology with all the required data atdeployment stage. After the completion of the init-config job pod, the topology handlingis managed by the keeper pod.

The CNDRA solution supports three types of pods:

• Keeper

• Overseer

• DraWorker

Keeper

The keeper pods helps in managing the topology for the Kubernetes environment.

Overseer

The overseer pods are mainly responsible for operation and managementfunctionality. The overseer pods manages the topology and all the application levelconfiguration data. In CNDRA, the overseer maintains all the configuration.

The CNDRA overseer pods are deployed in group of four instances. The instances areassigned with the following HA roles:

• Active

• Standby

• Spare

• Spare

The overseer deployment is configured with Anti-pod affinity so that each overseer podis scheduled by Kubernetes on different worker nodes, which are available. If theworker nodes are not available, Kubernetes schedules them as per the availability ofresources among existing worker nodes.

DraWorker

The draworker is mainly responsible for managing the Diameter traffic (signaling). Thedraworker pods always gets the latest OAM configuration data from overseer viacomcol replication channels.

3-1

Overview

The CNDRA 1.5.0 release supports four overseer pods in deployment to enableDisaster Recovery.

Out of the four overseer pods in CNDRA topology, one is ACTIVE, one is STANDBYand two are in SPARE state. During Disaster Recovery, when both the Active andStandby Overseer comes down, the remaining two SPARE pods takes up the role ofACTIVE and STANDBY Overseer, whereas the two new overseer pods are added bythe Keeper in SPARE state.

The role transition flow between different HA states are as follows:

• Spare → Active

• Spare → Standby

• Standby → Active

Worker Node Failure

The Worker node failure causes re-spawning of the Active overseer pod running on it.In that scenario, the Standby overseer assumes the Active Role and handles thetopology from this point.

The overseer pods are distributed across the Worker Node's in the Kubernetes cluster.Kubernetes cluster having four worker nodes, and overseer pods are distributed ondifferent worker nodes.

With 4 overseer support and having single overseer scheduled per Worker node, thefailure chances reduces to 25% from 50% when compared to previous releases.

In a Kubernetes cluster having less than four worker nodes, the overseer distributionon the worker nodes would be asymmetric. In such a case, there could be one or moreoverseer pods scheduled on a given Worker node.

In such case, if a Worker Node fails, the Active role is taken up by a different overseerinstance.

Cluster Failure

In case of Kubernetes cluster failure, such as when the complete set of Worker nodefails in the cluster or sub-set of cluster where the CNDRA is deployed, the CNDRAdeployment is cleaned up.

Since there is no other way to recover the previous deployment, it is recommended totake backup of the diameter configuration once the configuration is completed on theCNDRA deployment.In case of failure, the user should deploy a fresh CNDRA topology and import thediameter configuration backed up previously from the older deployment.

Multiple Deployment SupportThe CNDRA comes with Multiple Deployment support. This feature allows the user tobind the deployments to specific Worker nodes (WN).

The previous CNDRA version with single deployment had the following limitations:

• For initiator connections, the Peer node only accepts packets having valid SRC IP(Public IP) of WN.

Chapter 3Multiple Deployment Support

3-2

As the packets sent by CNDRA for initiator connection was able to carry any SRCIP (Public IP) of WN in cluster, the pods were re-scheduled on any WN withincluster or node-selector label based WNs set.

This affect of getting initiator connections to schedule on any worker node requiredspecial consideration at Peer node side for connection configuration with SRC IPvalidation on incoming packets.

Therefore to support single CNDRA initiator connection, the Peer node (with SRCIP validation) requires multiple connection configured (such as to each WNs) toaccept connection from any one of WN at a time.

• Responder connections of the deployment gets impacted upon WN failure orfailure of last service Pod on a WN.The Load balancer uses WN based topology to route the ingress traffic, byhashing to one of the WN.

The diameter service configured with externalTrafficPolicy as Local (topreserve the SRC IP), failure of the last pod of WN or WN failure causes topologychange for Load Balancing.

This causes reset of the active responder connections with other WNs.

The feature helps user to address following diameter connection related requirementsand aspects by defining deployment to a given set of WNs, such as:

1. For initiator connections, ensure that the connection attempt towards Peer nodealways have the same SRC IP irrespective of Pod failures.

2. For Responder connections, ensure that any pod or worker node failure events donot impact other existing responder connections.

The CNDRA supports both single and multiple deployments. The user can chooseeither single or multiple deployment during CNDRA installation. However thedeployment cannot be modified post installation, such as single to multiple, multiple tosingle and multiple to multiple.

For example:

• If the user installed CNDRA with single deployment and later wants to change tomultiple deployment.

• If the user installed CNDRA with multiple deployment (two deployments) and laterwants to change to four deployments.

Note:

There could be one or more deployment bound to one or more WN based onthe cluster usage planning.

Mediation SupportCNDRA supports Mediation configuration through Mediation GUI. Mediation is enabledor disabled by setting overseer helm flag isMediationEnabled flag to true or falseincustom_value.yaml file.

Chapter 3Mediation Support

3-3

The Mediation flag can be set during a fresh installation or can be modified on theexisting deployment using overseer upgrade mechanism. The overseer upgrade isperformed using overseer custom value file updated with new values.

The enabling/disabling of Mediation flag only controls the access to the Mediation GUI.Whereas the Mediation rules are configured by the user using the Mediation GUI. TheMediation configuration must be manually cleaned up before disabling the mediationfeature.

Note:

Before disabling mediation feature, the previous mediation configurationmust be deleted or removed. Failing to do so would have a system withprevious mediation configuration and active rules, if Mediation rules wereenabled.

After enabling the Mediation feature, the user can access Mediation GUI usingoverseer service's LB IP address and service port https://OM-LB-IP:Port.

The default login credential for the Mediation GUI are:

• user-guiadmin

• pass-texxxxx

Chapter 3Mediation Support

3-4

4CNDRA Metrics, KPIs, and Alerts

This section includes information about the Metrics, KPIs, and Alerts associated toCNDRA.

This section describes the Alert rules configuration for CNDRA. The Alert Manageruses the Prometheus measurements values reported by microservices as perconditions under Alert rules to trigger Alerts.

CNDRA supports the following two types of Alert notification:

• Measurement Based AlertsThese alerts are reported using the following components:

– CNDRA Pods

– Prometheus

– Alert Manager

– SNMPNotifier

– SNMPManager

• App Based AlertsThese alerts are reported using the following components:

– CNDRA Pods

– Alert Manager

– SNMPNotifier

– SNMPManager

Alert Rules Configuration

Perform the following configuration steps to:

1. Find the configmap to configure alerts for alertmanager, by executing:$ kubectl get configmap -n <_Namespace_>

2. Save the copy of current config map data of Alert Manager, by executing:$ kubectl get configmaps <_NAME_> -o yaml -n <_Namespace_> > /tmp/t_mapConfig.yaml

3. Delete entry "alertscndra" under rule_files, if present in the Alert Managerconfig map, by executing:$ sed -i '/etc\/config\/alertscndra/d' /tmp/t_mapConfig.yaml

Note:

To be executed only once.

4. Add entry "alertscndra" under rule_files in the Alert Manager config map, byexecuting:

4-1

$ sed -i '/rule_files:/a\ \- /etc/config/alertscndra' /tmp/t_mapConfig.yaml

Note:

To be executed only once.

5. Reload the modified configmap incase it has been modified, by executing:$ kubectl replace configmap <_NAME_> -f /tmp/t_mapConfig.yaml

Note:

Not required for AlertRules.

6. Add cndraAlertrules into config map, by executing:$ kubectl patch configmap <_NAME_> -n <_Namespace_> --type merge --patch "$(cat ~/cndraAlertrules.yaml)"

Note:

The alertmanager, prometheus tools are running in occne-infra (default)namespace of CNE 1.2.

Alert Rules file

The sample below represents the configuration values provided incndraAlertRules.yaml file:

apiVersion: v1 data: alertscndra: | groups: - name: CndraAlerts rules: - alert: RoutingMsg expr: rate(DiamTrans{nfType="occndra",type="DiamPerf",info="RoutingMsgs"}[5m]) >= 27000 < 36000 labels: severity: "minor" alertname: "RoutingMsg" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22221" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RoutingMsgRate is {{ $value | printf " %.2f " }} above threshold 27000' summary: 'Alarm RoutingMsg , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }} ' - alert: RoutingMsg

Chapter 4

4-2

expr: rate(DiamTrans{nfType="occndra",type="DiamPerf",info="RoutingMsgs"}[5m]) >= 36000 < 42750 labels: severity: "major" alertname: "RoutingMsg" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22221" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RoutingMsgRate is {{ $value | printf " %.2f " }} above threshold 36000' summary: 'Alarm RoutingMsg , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RoutingMsg expr: rate(DiamTrans{nfType="occndra",type="DiamPerf",info="RoutingMsgs"}[5m]) >= 42750 labels: severity: "critical" alertname: "RoutingMsg" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22221" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RoutingMsgRate is {{ $value | printf " %.2f " }} above threshold 42750' summary: 'Alarm RoutingMsg , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraRxAllRate expr: (draworkerIc{nfType="occndra",direction="Rx",info="DraIrtRate"})/100 >= 27000 < 36000 labels: severity: "minor" alertname: "DraRxAllRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22201" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxAllRate is {{ $value | printf " %.2f " }} above threshold 27000' summary: 'Alarm DraRxAllRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraRxAllRate expr: (draworkerIc{nfType="occndra",direction="Rx",info="DraIrtRate"})/100 >= 36000 < 42750 labels: severity: "major" alertname: "DraRxAllRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22201"

Chapter 4

4-3

namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxAllRate is {{ $value | printf " %.2f " }} above threshold 36000' summary: 'Alarm DraRxAllRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraRxAllRate expr: (draworkerIc{nfType="occndra",direction="Rx",info="DraIrtRate"})/100 >= 42750 labels: severity: "critical" alertname: "DraRxAllRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22201" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxAllRate is {{ $value | printf " %.2f " }} above threshold 42750' summary: 'Alarm DraRxAllRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: TmAvgRspTime expr: AvgResp{nfType="occndra",info="TmAvgRspTime"} >= 3000 < 5000 labels: severity: "minor" alertname: "TmAvgRspTime" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22224" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'TmAvgRspTime is {{ $value | printf " %.2f " }} above threshold 3000' summary: 'Alarm TmAvgRspTime , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }} ' - alert: TmAvgRspTime expr: AvgResp{nfType="occndra",info="TmAvgRspTime"} >= 5000 < 7000 labels: severity: "major" alertname: "TmAvgRspTime" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22224" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'TmAvgRspTime is {{ $value | printf " %.2f " }} above threshold 5000' summary: 'Alarm TmAvgRspTime , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: TmAvgRspTime expr: AvgResp{nfType="occndra",info="TmAvgRspTime"} >= 7000

Chapter 4

4-4

labels: severity: "critical" alertname: "TmAvgRspTime" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22224" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'TmAvgRspTime is {{ $value | printf " %.2f " }} above threshold 7000' summary: 'Alarm TmAvgRspTime , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }},, release:{{ $labels.release }}' - alert: DraRxDiamAllLen expr: (draworkerPerf{nfType="occndra",info="DraWorkerRxDiamAllLen"}/100) >= 2048 < 4096 labels: severity: "minor" alertname: "DraRxDiamAllLen" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22225" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxDiamAllLen is {{ $value | printf " %.2f " }} above threshold 2048' summary: 'Alarm DraRxDiamAllLen , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraRxDiamAllLen expr: (draworkerPerf{nfType="occndra",info="DraWorkerRxDiamAllLen"}/100) >= 4096 < 16364 labels: severity: "major" alertname: "DraRxDiamAllLen" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22225" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxDiamAllLen is {{ $value | printf " %.2f " }} above threshold 4096' summary: 'Alarm DraRxDiamAllLen , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraRxDiamAllLen expr: (draworkerPerf{nfType="occndra",info="DraWorkerRxDiamAllLen"}/100) >= 16364 labels: severity: "critical" alertname: "DraRxDiamAllLen" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22225" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraRxDiamAllLen is {{ $value | printf " %.2f " }} above threshold 16364 '

Chapter 4

4-5

summary: 'Alarm DraRxDiamAllLen , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraReroutePercent expr: draReroute{nfType="occndra",info="draworkerReroutePercent"} >= 20 labels: severity: "major" alertname: "DraReroutePercent" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22077" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraReroutePercent is {{ $value | printf " %.2f " }} above threshold 20 ' summary: 'Alarm DraReroutePercent , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: PtrList expr: PtrList{nfType="occndra",info="PtrList"} >= 60 < 80 labels: severity: "minor" alertname: "PtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22203" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'PtrList is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm PtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: PtrList expr: PtrList{nfType="occndra",info="PtrList"} >= 80 < 95 labels: severity: "major" alertname: "PtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22203" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'PtrList is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm PtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: PtrList expr: PtrList{nfType="occndra",info="PtrList"} >= 95 labels: severity: "critical" alertname: "PtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22203" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations:

Chapter 4

4-6

description: 'PtrList is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm PtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRequestMsgQueue expr: drlReqMsgQue{nfType="occndra",info="RxRequestMsgQueue"} >= 60 < 80 labels: severity: "minor" alertname: "RxRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22204" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRequestMsgQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm RxRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRequestMsgQueue expr: drlReqMsgQue{nfType="occndra",info="RxRequestMsgQueue"} >= 80 < 95 labels: severity: "major" alertname: "RxRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22204" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRequestMsgQueue is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm RxRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRequestMsgQueue expr: drlReqMsgQue{nfType="occndra",info="RxRequestMsgQueue"} >= 95 labels: severity: "critical" alertname: "RxRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22204" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRequestMsgQueue is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm RxRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxAnswerMsgQueue expr: drlAnsmsgQue{nfType="occndra",info="RxAnswerMsgQueue"} >= 60 < 80 labels:

Chapter 4

4-7

severity: "minor" alertname: "RxAnswerMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22205" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxAnswerMsgQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm RxAnswerMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxAnswerMsgQueue expr: drlAnsmsgQue{nfType="occndra",info="RxAnswerMsgQueue"} >= 80 < 95 labels: severity: "major" alertname: "RxAnswerMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22205" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxAnswerMsgQueue is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm RxAnswerMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxAnswerMsgQueue expr: drlAnsmsgQue{nfType="occndra",info="RxAnswerMsgQueue"} >= 95 labels: severity: "critical" alertname: "RxAnswerMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22205" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxAnswerMsgQueue is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm RxAnswerMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: TxRerouteQueue expr: drltxreQue{nfType="occndra",info="TxRerouteQueue"} >= 60 < 80 labels: severity: "minor" alertname: "TxRerouteQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22206" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'TxRerouteQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm TxRerouteQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}'

Chapter 4

4-8

- alert: TxRerouteQueue expr: drltxreQue{nfType="occndra",info="TxRerouteQueue"} >= 80 < 95 labels: severity: "major" alertname: "TxRerouteQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22206" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: ' TxRerouteQueue is {{ $value | printf " %.2f " }} above threshold 80' summary: ' Alarm TxRerouteQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: TxRerouteQueue expr: drltxreQue{nfType="occndra",info="TxRerouteQueue"} >= 95 labels: severity: "critical" alertname: "TxRerouteQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22206" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'TxRerouteQueue is {{ $value | printf " %.2f " }} above threshold 95' summary: ' Alarm TxRerouteQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxTaskQueue expr: draworkerPerf{nfType="occndra",info="DclTxTaskQueue"} >= 60 < 80 labels: severity: "minor" alertname: "DclTxTaskQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22207" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DclTxTaskQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm DclTxTaskQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxTaskQueue expr: draworkerPerf{nfType="occndra",info="DclTxTaskQueue"} >= 80 < 95 labels: severity: "major" alertname: "DclTxTaskQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22207" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DclTxTaskQueue is {{ $value | printf " %.2f " }} above threshold 80'

Chapter 4

4-9

summary: 'Alarm DclTxTaskQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxTaskQueue expr: draworkerPerf{nfType="occndra",info="DclTxTaskQueue"} >= 95 labels: severity: "critical" alertname: "DclTxTaskQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22207" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DclTxTaskQueue is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm DclTxTaskQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxConnQueue expr: ConnectionPerf{nfType="occndra",info="DclTxConnQueue"} >= 60 < 80 labels: severity: "minor" alertname: "DclTxConnQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22208" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DclTxConnQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm DclTxConnQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxConnQueue expr: ConnectionPerf{nfType="occndra",info="DclTxConnQueue"} >= 80 < 95 labels: severity: "major" alertname: "DclTxConnQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22208" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DclTxConnQueue is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm DclTxConnQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DclTxConnQueue expr: ConnectionPerf{nfType="occndra",info="DclTxConnQueue"} >= 95 labels: severity: "critical" alertname: "DclTxConnQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22208" namespace: ' {{ $labels.kubernetes_namespace }} '

Chapter 4

4-10

annotations: description: 'DclTxConnQueue is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm DclTxConnQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: LongTimeoutPtrList expr: PTRLong{nfType="occndra",info="LongTimeoutPtrList"} >= 60 < 80 labels: severity: "minor" alertname: "LongTimeoutPtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22222" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'LongTimeoutPtrList is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm LongTimeoutPtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: LongTimeoutPtrList expr: PTRLong{nfType="occndra",info="LongTimeoutPtrList"} >= 80 < 95 labels: severity: "major" alertname: "LongTimeoutPtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22222" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'LongTimeoutPtrList is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm LongTimeoutPtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: LongTimeoutPtrList expr: PTRLong{nfType="occndra",info="LongTimeoutPtrList"} >= 95 labels: severity: "critical" alertname: "LongTimeoutPtrList" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22222" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'LongTimeoutPtrList is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm LongTimeoutPtrList , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRbarRequestMsgQueue expr: Application{nfType="occndra",info="RbarRequestMsgQueue"} >= 60 < 80 labels:

Chapter 4

4-11

severity: "minor" alertname: "RxRbarRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22502" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarRequestMsgQueue is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm RxRbarRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRbarRequestMsgQueue expr: Application{nfType="occndra",info="RbarRequestMsgQueue"} >= 80 < 95 labels: severity: "major" alertname: "RxRbarRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22502" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarRequestMsgQueue is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm RxRbarRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRbarRequestMsgQueue expr: Application{nfType="occndra",info="RbarRequestMsgQueue"} >= 95 labels: severity: "critical" alertname: "RxRbarRequestMsgQueue" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22502" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarRequestMsgQueue is {{ $value | printf " %.2f "}} above threshold 95' summary: 'Alarm RxRbarRequestMsgQueue , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRbarMsgRate expr: rate(Application{nfType="occndra",info="RbarMsgRate"}[5m]) >= 24000 < 32000 labels: severity: "minor" alertname: "RxRbarMsgRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22504" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarMsgRate is {{ $value | printf " %.2f " }} above threshold 24000' summary: 'Alarm RxRbarMsgRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:

Chapter 4

4-12

{{ $labels.release }}' - alert: RxRbarMsgRate expr: rate(Application{nfType="occndra",info="RbarMsgRate"}[5m]) >= 32000 < 38000 labels: severity: "major" alertname: "RxRbarMsgRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22504" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarMsgRate is {{ $value | printf " %.2f " }} above threshold 32000' summary: 'Alarm RxRbarMsgRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: RxRbarMsgRate expr: rate(Application{nfType="occndra",info="RbarMsgRate"}[5m]) >= 38000 labels: severity: "critical" alertname: "RxRbarMsgRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22504" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'RxRbarMsgRate is {{ $value | printf " %.2f "}} above threshold 38000' summary: 'Alarm RxRbarMsgRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: ComAgentIngressStackEventRate expr: rate(ComAgent{nfType="occndra",info="CARx"}[5m]) >= 75000 < 80000 labels: severity: "minor" alertname: "ComAgentIngressStackEventRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19862" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'ComAgentIngressStackEventRate is {{ $value | printf " %.2f " }} above threshold 75000' summary: 'Alarm ComAgentIngressStackEventRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: ComAgentIngressStackEventRate expr: rate(ComAgent{nfType="occndra",info="CARx"}[5m]) >= 80000 < 84000 labels: severity: "major" alertname: "ComAgentIngressStackEventRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19862" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations:

Chapter 4

4-13

description: 'ComAgentIngressStackEventRate is {{ $value | printf " %.2f " }} above threshold 80000' summary: 'Alarm ComAgentIngressStackEventRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: ComAgentIngressStackEventRate expr: rate(ComAgent{nfType="occndra",info="CARx"}[5m]) >= 84000 labels: severity: "critical" alertname: "ComAgentIngressStackEventRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19862" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'ComAgentIngressStackEventRate is {{ $value | printf " %.2f " }} above threshold 84000' summary: 'Alarm ComAgentIngressStackEventRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraCpu expr: avg_over_time(draworkerCpuutil{nfType="occndra",info="draworkerCpu"}[5m]) >= 75 < 80 labels: severity: "info" alertname: "DraCpu" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22200" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraCpu is {{ $value | printf " %.2f "}} above threshold 75' summary: 'Alarm DraCpu ,NS:{{ $labels.namespace }}, timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraCpu expr: avg_over_time(draworkerCpuutil{nfType="occndra",info="draworkerCpu"}[5m]) >= 80 < 85 labels: severity: "minor" alertname: "DraCpu" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22200" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraCpu is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm DraCpu , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraCpu expr: avg_over_time(draworkerCpuutil{nfType="occndra",info="draworkerCpu"}[5m])

Chapter 4

4-14

>= 85 < 90 labels: severity: "major" alertname: "DraCpu" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22200" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraCpu is {{ $value | printf " %.2f " }} above threshold 85' summary: 'Alarm DraCpu , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: DraCpu expr: avg by (instance,nfType) (avg_over_time(draworkerCpuutil{nfType="occndra",info="draworkerCpu"}[5m])) >= 90 labels: severity: "critical" alertname: "DraCpu" oid: "1.3.6.1.4.1.323.5.3.48.1.2.22200" namespace: ' {{ $labels.kubernetes_namespace }} ' annotations: description: 'DraCpu is {{ $value | printf " %.2f " }} above threshold 90' summary: 'Alarm DraCpu , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.instancename }}, release:{{ $labels.release }}' - alert: ComAgentQueueUtil expr: ComAgent{nfType="occndra",info="ComAgentQueueUtil"} >= 60 < 80 labels: severity: "minor" alertname: "ComAgentQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19803" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'ComAgentQueueUtil is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm ComAgentQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: ComAgentQueueUtil expr: ComAgent{nfType="occndra",info="ComAgentQueueUtil"} >= 80 < 95 labels: severity: "minor" alertname: "ComAgentQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19803" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations:

Chapter 4

4-15

description: 'ComAgentQueueUtil is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm ComAgentQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: ComAgentQueueUtil expr: ComAgent{nfType="occndra",info="ComAgentQueueUtil"} >= 95 labels: severity: "minor" alertname: "ComAgentQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19803" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'ComAgentQueueUtil is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm ComAgentQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: ComAgentAbnormTransEndRate expr: (ComAgent{nfType="occndra",info="ComAgentAbnormTransEndRate"}/100) >= 5 < 8 labels: severity: "minor" alertname: "ComAgentAbnormTransEndRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19825" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'ComAgentAbnormTransEndRate is {{ $value | printf " %.2f " }} above threshold 5' summary: 'Alarm ComAgentAbnormTransEndRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: ComAgentAbnormTransEndRate expr: (ComAgent{nfType="occndra",info="ComAgentAbnormTransEndRate"}/100) >= 8 < 12 labels: severity: "minor" alertname: "ComAgentAbnormTransEndRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19825" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'ComAgentAbnormTransEndRate is {{ $value | printf " %.2f " }} above threshold 8' summary: 'Alarm ComAgentAbnormTransEndRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: ComAgentAbnormTransEndRate

Chapter 4

4-16

expr: (ComAgent{nfType="occndra",info="ComAgentAbnormTransEndRate"}/100) >= 12 labels: severity: "minor" alertname: "ComAgentAbnormTransEndRate" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19825" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'ComAgentAbnormTransEndRate is {{ $value | printf " %.2f " }} above threshold 12' summary: 'Alarm ComAgentAbnormTransEndRate , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: SmsQueueUtil expr: ComAgent{nfType="occndra",info="SmsQueueUtil"} >= 60 < 80 labels: severity: "minor" alertname: "SmsQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19827" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'SmsQueueUtil is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm SmsQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: SmsQueueUtil expr: ComAgent{nfType="occndra",info="SmsQueueUtil"} >= 80 < 95 labels: severity: "minor" alertname: "SmsQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19827" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'SmsQueueUtil is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm SmsQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: SmsQueueUtil expr: ComAgent{nfType="occndra",info="SmsQueueUtil"} >= 95 labels: severity: "minor" alertname: "SmsQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19827" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'SmsQueueUtil is {{ $value | printf " %.2f " }}

Chapter 4

4-17

above threshold 95' summary: 'Alarm SmsQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: BDFQueueUtil expr: ComAgent{nfType="occndra",info="BDFQueueUtil"} >= 60 < 80 labels: severity: "minor" alertname: "BDFQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19420" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'BDFQueueUtil is {{ $value | printf " %.2f " }} above threshold 60' summary: 'Alarm BDFQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: BDFQueueUtil expr: ComAgent{nfType="occndra",info="BDFQueueUtil"} >= 80 < 95 labels: severity: "minor" alertname: "BDFQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19420" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'BDFQueueUtil is {{ $value | printf " %.2f " }} above threshold 80' summary: 'Alarm BDFQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} ' - alert: BDFQueueUtil expr: ComAgent{nfType="occndra",info="BDFQueueUtil"} >= 95 labels: severity: "minor" alertname: "BDFQueueUtil" oid: "1.3.6.1.4.1.323.5.3.48.1.2.19420" namespace: ' {{ $labels.kubernetes_namespace }} ' instancename: ' {{ $labels.SubMeasId }} ' annotations: description: 'BDFQueueUtil is {{ $value | printf " %.2f " }} above threshold 95' summary: 'Alarm BDFQueueUtil , timestamp:{{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}, podname:{{ $labels.podname }}, instancename:{{ $labels.SubMeasId }}, release:{{ $labels.release }} '

For details about the required modifications in CNE elements for CNDRA Alerting/SNMP Integration, see CNE modification for CNDRA Alerting and SNMP Integration.

Chapter 4

4-18

5CNDRA Alarms and Events

This section provides details about the alarms and events associated to CNDRA.

Alarms provide information pertaining to a system's operational condition that anetwork manager may need to act upon. An alarm represents a condition change inthe software that needs to be notified or requires operator attention, for example, acommunications link that has changed from connected to disconnected state.

Alarms and events are recorded in a database log table. Application event loggingprovides an efficient way to record event instance information in a manageable form,and is used to:

• Record events representing alarmed conditions.

• App based alarms are generated directly by cndra nodes to CNE component suchas, Alert Manager.

• Metric based alarms are raised by Prometheus using Alert Rules, using metricdata retrieved from cndra nodes.

App Alarm and Event information for CNDRA

AlarmID

Group Name OID Description

8000 DIAM MpEvFsmException

eagleXgDiameterMpEvFsmExceptionNotify

DraWorker connection FSMexception.

8001 DIAM MpEvException

eagleXgDiameterMpEvExceptionNotify

DraWorker exception.

8002 DIAM MpEvRxException

eagleXgDiameterMpEvRxExceptionNotify

DraWorker ingress messageprocessing exception.

8003 DIAM MpEvTxException

eagleXgDiameterMpEvTxExceptionNotify

DraWorker egress messageprocessing exception.

8004 DIAM EvFsmAdState

eagleXgDiameterEvFsmAdStateNotify

Connection FSM admin statechange.

8005 DIAM EvFsmOpState

eagleXgDiameterEvFsmOpStateNotify

Connection FSM operational statechange.

8006 DIAM EvFsmException

eagleXgDiameterEvFsmExceptionNotify

Connection FSM exception.

8007 DIAM EvException eagleXgDiameterEvExceptionNotify

Connection exception.

8008 DIAM EvRxException

eagleXgDiameterEvRxExceptionNotify

Connection ingress messageprocessing exception.

8009 DIAM EvTxException

eagleXgDiameterEvTxExceptionNotify

Connection egress messageprocessing exception.

8010 DIAM MpIngressDrop

eagleXgDiameterMpIngressDropNotify

DraWorker ingress messagediscarded or rejected.

8011 DIAM EcRate eagleXgDiameterEcRateNotify

Connection egress message ratethreshold crossed.

5-1

AlarmID


8012 DIAM MpRxNgnPsOfferedRate

eagleXgDiameterMpRxNgnPsOfferedRateNotify

DraWorker ingress NGN-PSmessage rate threshold crossed.

8013 DIAM MpNgnPsStateMismatch

eagleXgDiameterMpNgnPsStateMismatchNotify

DraWorker NGN-PS administrativeand operational state mismatch.

8014 DIAM MpNgnPsDrop

eagleXgDiameterMpNgnPsDropNotify

DraWorker NGN-PS messagediscarded or rejected.

8015 DIAM NgnPsMsgMisrouted

eagleXgDiameterNgnPsMsgMisroutedNotify

NGN-PS message routed to peerCNDRA lacking NGN-PS support.

8016 DIAM MpP16StateMismatch

eagleXgDiameterMpP16StateMismatchNotify

MP P16 Support administrative andoperational state mismatch.

8017 DIAM MpTaskCpuCongested

eagleXgDiameterMpTaskCpuCongestedNotify

DraWorker Task CPU utilizationthreshold crossed.

8018 DIAM P16MsgMisrouted

eagleXgDiameterP16MsgMisroutedNotify

P16 message routed to peerCNDRA lacking P16 support.

8019 DIAM MpAnswerPriorityModeMismatch

eagleXgDiameterMpAnswerPriorityModeMismatchNotify

DraWorker Answer Priority Modeadministrative and operational statemismatch.

8020 DIAM MismatchRoutingThread PoolOperationalState

eagleXgDiameterMpRoutingThreadPoolStateMismatchNotify

Mismatch in Routing Thread Pooloperational state and administrativestate.

22001 DIAM MessageDecodingFailure

eagleXgDiameterIngressMsgRejectedDecodingFailureNotify

Message received from a peer thatwas rejected because of a decodingfailure. Decoding failures can includemissing mandatory parameters.

22002 DIAM PeerRoutingRules withSamePriority

eagleXgDiameterPeerRoutingTableRulesSamePriorityNotify

A peer routing table search with areceived Request message foundmore than one highest priority peerrouting rule match.

22003 DIAM ApplicationID Mismatchwith Peer

eagleXgDiameterApplicationIdMismatchWithPeerNotify

While attempting to route a requestmessage to a peer, a peer'sdiameter connection was bypassedbecause the peer did not support theApplication ID for that diameterconnection.

22004 DIAM Maximumpendingtransactionsallowedexceeded

eagleXgDiameterMaxPendingTxnsPerConnExceededNotify

Routing attempted to select anegress Diameter connection toforward a message but themaximum number of allowedpending transactions queued on theDiameter connection has beenreached.

Chapter 5

5-2

AlarmID


22005 DIAM No peerrouting rulefound

eagleXgDiameterNoPrtRuleNotify

A message not addressed to a peer(either Destination-Host AVP wasabsent or Destination-Host AVP waspresent but was not a peer's FQDN)could not be routed because no peerrouting rules were found.

22007 DIAM InconsistentApplicationID Lists froma Peer

eagleXgDiameterSupportedAppIdsInconsistentNotify

The list of Application IDs supportedby a peer during the DiameterCapabilities Exchange procedure ona particular diameter connection isnot identical to one of the list ofApplication IDs received from thepeer over a different availablediameter connection to that peer.

22008 DIAM OrphanAnswerResponseReceived

eagleXgDiameterOrphanAnswerResponseReceivedNotify

An Answer response was receivedfor which no pending requesttransaction existed resulting in theAnswer message being discarded.

22009 DIAM ApplicationRoutingRules withSamePriority

eagleXgDiameterApplicationRoutingTableRulesSamePriorityNotify

An application routing table searchwith a received Request messagefound more than one highest priorityapplication routing rule match.

22010 DIAM SpecifiedDAS RouteList notprovisioned

eagleXgDiameterSpecifiedDasRouteListNotProvisionedNotify

The DAS Route List specified by thetrigger point is not provisioned.

22012 DIAM SpecifiedMCCS notprovisioned

eagleXgDiameterSpecifiedMCCSNotProvisionedNotify

The Message Copy ConfigurationSet specified by the trigger point isnot provisioned.

22013 DIAM DAS PeerNumber ofRetransmitsExceededfor Copy

eagleXgDiameterNumberOfRetransmitsExceededToDasNotify

The configured number of MessageCopy retransmits has beenexceeded for the DAS Peer.

22014 DIAM No DASRoute Listspecified

eagleXgDiameterNoDasRouteListSpecifiedNotify

No valid DAS Route List has beenspecified in the Message CopyConfiguration Set.

22018 DIAM MaintenanceLeader HANotificationto go Active

eagleXgDiameterDaMpLeaderGoActiveNotificationNotify

A DraWorker has received anotification from HA that theMaintenance Leader resourceshould transition to the Active role.

22019 DIAM MaintenanceLeader HANotificationto go OOS

eagleXgDiameterDaMpLeaderGoOOSNotificationNotify

A DraWorker has received anotification from HA that theMaintenance Leader resourceshould transition to the OOS role.

22020 DIAM CopyMessagesizeexceededthe systemconfiguredsize limit

eagleXgDiameterCopyMessageSizeExceededNotify

The generated Copy message sizeexceeded the max message size onthe system.

Chapter 5

5-3

AlarmID


22021 DIAM DebugRouting InfoAVPEnabled

eagleXgDiameterDebugRoutingInfoAvpEnabledNotify

Debug Routing Info AVP is Enabled.

22022 DIAM ForwardingLoopDetected

eagleXgDiameterForwardingLoopDetectedNotify

Ingress Request message receivedwas previously processed by thelocal node as determined from theRoute-Record AVP's received in themessage.

22051 DIAM PeerUnavailable

eagleXgDiameterPeerUnavailableNotify

Unable to access the Diameter Peerbecause all of the diameterconnections are Down.

22052 DIAM PeerDegraded

eagleXgDiameterPeerDegradedNotify

The peer has some availableconnections, but less than itsminimum connection capacity.Continued routing to this peer maycause congestion or other overloadconditions.

22053 DIAM Route ListUnavailable

eagleXgDiameterRouteListUnavailableNotify

Route List is Unavailable

22054 DIAM Route ListDegraded

eagleXgDiameterRouteListDegradedNotify

Route List Operational Status haschanged to Degraded because thecurrent weight of the Route List'sActive Route Group has droppedbelow the Route List's configuredminimum route group availabilityweight.

22055 DIAM Non-PreferredRoute GroupIn Use

eagleXgDiameterNonPreferredRouteGroupInUseNotify

The highest priority Route Group ofa Route List is not being utilized toroute Request messages becausethe highest priority Route Group haseither become unavailable or whoseweight has dropped below theminimum weight configured for theRoute List.

22056 DIAM ConnectionAdmin StateInconsistency Exists

eagleXgDiameterConnAdminStateInconsistencyNotify

An operator request to change theAdmin State of a diameterconnection was not completelyprocessed due to an internal error.

22057 DIAM ETG RateLimitDegraded

eagleXgDiameterEtgRateLimitDegradedNotify

The ETG Rate Limit has exceededthe defined threshold.

22058 DIAM ETGPendingTransactionLimitDegraded

eagleXgDiameterEtgPendingTransLimitDegradedNotify

The ETG Pending Transactions Limithas exceeded the defined threshold.

Chapter 5

5-4

AlarmID


22059 DIAM EgressThrottleGroupMessageRateCongestionLevelchanged

eagleXgDiameterEtgRateCongestionNotify

The Egress Throttle Group RequestMessage rate Congestion Level haschanged. This will change theRequest priority that can be routedto peers and connections in theETG.

22060 DIAM EgressThrottleGroupPendingTransactionLimitCongestionLevelchanged

eagleXgDiameterEtgPendingTransCongestionNotify

The Egress Throttle Group PendingTransaction Limit Congestion Levelhas changed. This will change theRequest priority that can be routedon peers and connections in theETG.

22061 DIAM ETGMonitoringstopped

eagleXgDiameterEtgMonitoringStoppedNotify

ETG Rate and Pending TransactionMonitoring is stopped on allconfigured ETGs.

22062 DIAM Actual HostNamecannot bedeterminedfor TopologyHiding

eagleXgDiameterTopoHidingActualHostNameNotFoundNotify

Topology Hiding could not beapplied because the Actual HostName could not be determined.

22063 DIAM DiameterMaxMessageSize LimitExceeded

eagleXgDiameterDiameterMaxMsgSizeLimitExceededNotify

The size of the message encoded byPeer CNDRA has exceeded its maxlimits.

22064 DIAM UponreceivingRedirectHostNotificationthe Requesthas notbeensubmittedfor re-routing

eagleXgDiameterRxRedirectHostNotRoutedNotify

Peer CNDRA received a RedirectHost Notification that it can acceptfor processing but cannot continuedue to some reason.

22065 DIAM UponreceivingRedirectRealmNotificationthe Requesthas notbeensubmittedfor re-routing

eagleXgDiameterRxRedirectRealmNotRoutedNotify

Peer CNDRA received a RedirectRealm Notification that it can acceptfor processing but cannot continuedue to some reason.

Chapter 5

5-5

AlarmID


22066 DIAM ETG-ETLScopeInconsistency

eagleXgDiameterEtgEtlScopeInconsistencyNotify

ETG Control Scope Inconsistency.

22067 DIAM ETL - ETGInvalidAssociation

eagleXgDiameterEtgEtlInvalidAssocNotify

Invalid ETL - ETG Association.

22068 DIAM TtpEvDoicException

eagleXgDiameterTtpEvDoicExceptionNotify

DOIC Protocol Error.

22069 DIAM Valid DOICOLR Appliedto TTP

eagleXgDiameterTtpEvDoicOlrNotify

A DOIC OverLoad Request (OLR)was received from a Peer Node andapplied to a configured TTP.

22070 DIAM TtpEvDegraded

eagleXgDiameterTtpEvDegradedNotify

TTP Degraded.

22071 DIAM TTG LossPercentChanged

eagleXgDiameterTtgEvLossChgNotify

TTG's Loss Percentage wasmodified.

22072 DIAM TTPDegraded

eagleXgDiameterTtpDegradedNotify

The TTP's Operational Status hasbeen changed to Degraded.

22073 DIAM TTPThrottlingStopped

eagleXgDiameterTtpThrottlingStoppedNotify

TTP rate throttling has beensuspended due to an internal failure.

22074 DIAM TTPMaximumLossPercentageThresholdExceeded

eagleXgDiameterTtpMaxLossPercentageExceededNotify

The Maximum Loss PercentageThreshold assigned to the TTP hasbeen exceeded.

22075 DIAM Message isnot routed toapplication

eagleXgDiameterArtMatchAppUnavailableNotify

ART Rule-X is selected but messageis not routed because Peer CNDRAApplication is Disabled or notAvailable.

22076 DIAM TTGMaximumLossPercentageThresholdExceeded

eagleXgDiameterTtgMaxLossPercentageExceededNotify

The Maximum Loss PercentageThreshold assigned to the RouteGroup within the Route List hasbeen exceeded.

22077 DIAM ExcessiveRequestRerouteThresholdExceeded

eagleXgDiameterMpExcessiveRequestRerouteNotify

Request reroutes due to Answerresponse and/or Answer timeout hasexceeded the configured thresholdon the DraWorker server.

22078 DIAM Loop orMaximumDepthExceeded inART or PRTSearch

eagleXgDiameterNestedArtPrtSearchErrorNotify

Loop or Maximum Depth Exceededin ART or PRT Search.

Chapter 5

5-6

AlarmID


22081 DIAM ReceivedInvalidOrigin Hostin Answer orTTP statehas changedfor DOICprocessing.

eagleXgDiameterInvalidOriginHostAvpForDoicNotify

Received Invalid Origin Host inAnswer or TTP state has changedfor DOIC processing.

22101 DIAM FsmOpStateUnavailable

eagleXgDiameterFsmOpStateUnavailableNotify

Connection is operationallyunavailable.

22102 DIAM FsmOpStateDegraded

eagleXgDiameterFsmOpStateDegradedNotify

Connection is operationallydegraded.

22103 DIAM SctpPathUnavailable

eagleXgDiameterSctpPathUnavailableNotify

SCTP multi-homed connection hasoperationally unavailable path.

22104 DIAM SctpPathMismatch

eagleXgDiameterSctpPathMismatchNotify

SCTP multi-homed connection haspath mismatch.

22200 DIAM MpCpuCongested

eagleXgDiameterMpCpuCongestedNotify

DraWorker CPU utilization thresholdcrossed.

22201 DIAM MpRxAllRate

eagleXgDiameterMpRxAllRateNotify

DraWorker ingress message ratethreshold crossed.

22202 DIAM MpDiamMsgPoolCongested

eagleXgDiameterMpDiamMsgPoolCongestedNotify

DraWorker Diameter message poolutilization threshold crossed.

22203 DIAM PTR BufferPoolUtilization

eagleXgDiameterPtrBufferPoolUtilNotify

The DraWorker's PTR buffer pool isapproaching its maximum capacity.

22204 DIAM RequestMessageQueueUtilization

eagleXgDiameterRequestMessageQueueUtilNotify

The DraWorker's request messagequeue utilization is approaching itsmaximum capacity.

22205 DIAM AnswerMessageQueueUtilization

eagleXgDiameterAnswerMessageQueueUtilNotify

The DraWorker's answer messagequeue utilization is approaching itsmaximum capacity.

22206 DIAM RerouteQueueUtilization

eagleXgDiameterRerouteQueueUtilNotify

The DraWorker's reroute queueutilization is approaching itsmaximum capacity.

22207 DIAM DclTxTaskQueueCongested

eagleXgDiameterDclTxTaskQueueCongestedNotify

DCL egress task message queueutilization threshold crossed.

22208 DIAM DclTxConnQueueCongested

eagleXgDiameterDclTxConnQueueCongestedNotify

DCL egress connection messagequeue utilization threshold crossed.

22209 DIAM MessageCopyDisabled

eagleXgDiameterMessageCopyDisabledNotify

Diameter Message Copy isDisabled.

22214 DIAM MessageCopy QueueUtilization

eagleXgDiameterMsgCopyQueueUtilNotify

The DraWorker's Message Copyqueue utilization is approaching itsmaximum capacity.

Chapter 5

5-7

AlarmID


22221 DIAM RoutingMPS Rate

eagleXgDiameterRoutingMpsRateNotify

Message processing rate for thisDraWorker is approaching orexceeding its engineered traffichandling capacity.

22222 DIAM LongTimeoutPTR BufferPoolUtilization

eagleXgDiameterLongTimeoutPtrBufferPoolUtilNotify

The DraWorker's Long Timeout PTRbuffer pool is approaching itsmaximum capacity.

22223 DIAM MpMemCongested

eagleXgDiameterMpMemCongestedNotify

DraWorker memory utilizationthreshold crossed.

22224 DIAM AverageHold TimeLimitExceeded

eagleXgDiameterAvgHoldTimeLimitExceededNotify

The average transaction hold timehas exceeded its configured limits.

22225 DIAM MpRxDiamAllLen

eagleXgDiameterMpRxDiamAllLenNotify

DraWorker diameter average ingressmessage length threshold crossed.

22328 DIAM IcRate eagleXgDiameterIcRateNotify

Connection ingress message ratethreshold crossed.

22350 DIAM ConnectionAlarmAggregationThresholdReached

eagleXgDiameterConnUnavailableThresholdReachedNotify

Connection Alarm AggregationThreshold Reached.

22400 RBAR MessageDecodingFailure

eagleXgDiameterRbarMsgRejectedDecodingFailureNotify

Message received was rejectedbecause of a decoding failure.

22401 RBAR UnknownApplicationID

eagleXgDiameterRbarUnknownApplIdNotify

Message could not be routedbecause the Diameter Application IDis not supported.

22402 RBAR UnknownCommandCode

eagleXgDiameterRbarUnknownCmdCodeNotify

Message could not be routedbecause the Diameter CommandCode in the ingress Requestmessage is not supported and theRouting Exception was configured tosend an Answer response.

22403 RBAR No RoutingEntityAddressAVPs

eagleXgDiameterRbarNoRoutingEntityAddrAvpNotify

Message could not be routedbecause no address AVPs werefound in the message and theRouting Exception was configured tosend an Answer response.

22404 RBAR No validRoutingEntityAddressesfound

eagleXgDiameterRbarNoValidRoutingEntityAddrFoundNotify

Message could not be routedbecause none of the address AVPscontained a valid address and theRouting Exception was configured tosend an Answer response.

Chapter 5

5-8

AlarmID


22405 RBAR Validaddressreceiveddidn't matchaprovisionedaddress oraddressrange

eagleXgDiameterRbarAddrMismatchWithProvisionedAddressNotify

Message could not be routedbecause a valid address was foundthat did not match an individualaddress or address rangeassociated with the Application ID,Command Code, and Routing EntityType and, the Routing Exceptionwas configured to send an Answerresponse.

22406 RBAR Routingattemptfailed due tointernalresourceexhaustion

eagleXgDiameterRbarRoutingAttemptFailureInternalResExhNotify

Message could not be routedbecause the internal RequestMessage Queue to the Peer CNDRARelay Agent was full.

22407 RBAR Routingattemptfailed due tointernaldatabaseinconsistency failure.

eagleXgDiameterRbarRoutingFailureInternalDbInconsistencyNotify

Message could not be routedbecause an internal addressresolution run-time databaseinconsistency was encountered.

22411 RBAR AddressRangeLookup forLocalIdentifierskipped

eagleXgDiameterRbarLocalIdentifierLookupSkippedNotify

Address Range Lookup could not beperformed for Local Identifiercomponent of Routing Entity TypeExternal Identifier. AddressResolution used the Destinationfound using Domain Identifier.

22500 APPL PeerCNDRAApplicationUnavailable

eagleXgDiameterCndraApplicationUnavailableNotify

Peer CNDRA Application is unableto process any messages because itis Unavailable.

22501 APPL PeerCNDRAApplicationDegraded

eagleXgDiameterCndraApplicationDegradedNotify

Unable to forward Requests to thePeer CNDRA Application because itis Degraded.

22502 APPL PeerCNDRAApplicationRequestMessageQueueUtilization

eagleXgDiameterCndraApplicationRequestQueueUtilNotify

The Peer CNDRA ApplicationRequest Message Queue Utilizationis approaching its maximumcapacity.

22503 APPL PeerCNDRAApplicationAnswerMessageQueueUtilization

eagleXgDiameterCndraApplicationAnswerQueueUtilNotify

The Peer CNDRA ApplicationAnswer Message Queue Utilizationis approaching its maximumcapacity.

Chapter 5

5-9

AlarmID


22504 APPL PeerCNDRAApplicationIngressMessageRate

eagleXgDiameterCndraApplicationIngressMsgRateNotify

The ingress message rate for thePeer CNDRA Application isapproaching or exceeding itsengineered traffic handling capacity.

22520 APPL PeerCNDRAApplicationEnabled

eagleXgDiameterCndraApplicationEnabledNotify

Peer CNDRA Application Adminstate was changed to 'enabled'.

22521 APPL PeerCNDRAApplicationDisabled

eagleXgDiameterCndraApplicationDisabledNotify

Peer CNDRA Application Adminstate was changed to 'disabled'.

22900 DIAM DPI DBTableMonitoringOverrun

eagleXgDiameterDpiTblMonCbOnLogOverrunNotify

DPI DB Table Monitoring Overrunhas occurred. The COMCOL updatesync log used by DB Tablemonitoring to synchronize DiameterConnection Status among allDraWorker RT-DBs has overrun.The DraWorker's DiameterConnection Status sharing table isautomatically audited and re-syncedto correct any inconsistencies.

22901 DIAM DPI DBTableMonitoringError

eagleXgDiameterDpiSldbMonAbnormalErrorNotify

An unexpected error occurred duringDB Table Monitoring.

22950 DIAM ConnectionStatusInconsistency Exists

eagleXgDiameterConnStatusInconsistencyExistsNotify

Diameter Connection statusinconsistencies exist among theDraWorkers in the Peer CNDRAsignaling NE.

22960 DIAM DraWorkerProfile NotAssigned

eagleXgDiameterDaMpProfileNotAssignedNotify

A DraWorker configuration profilehas not been assigned to thisDraWorker.

22961 DIAM InsufficientMemory forFeature Set.

eagleXgDiameterInsufficientAvailMemNotify

Available Memory for Feature Set isless than the Required Memory.

25500 DIAM NoDraWorkerLeaderDetected

eagleXgDiameterNoDaMpLeaderDetectedNotify

No DraWorker Leader Detected.

25510 DIAM MultipleDraWorkerLeaderDetected

eagleXgDiameterMultipleDaMpLeadersDetectedNotify

Multiple DraWorker LeaderDetected.

25611 DIAM ETG -InvalidDRMPAttributes

eagleXgDiameterEtgInvalidDRMPAttrbsNotify

DRMP attributes of ETG not in synchwith Remote ETGs associated withsame ETL.

25612 DIAM PeerCNDRA pingfailed

eagleXgDiameterPingAllLivePeerErrorNotify

Peer CNDRA ping echo to next hopshave failed. See bin/pingAllLivePeer.

Chapter 5

5-10

AlarmID


25805 PeerCNDRA

InvalidShared TTGReference

eagleXgDiameterDoicInvalidSharedTtgRefNotify

Invalid Shared TTG Reference

25806 PeerCNDRA

InvalidInternalOverseerServerGroupDesignation

eagleXgDiameterDoicInvalidInternalSoamSgDesignationNotify

Invalid Internal Overseer ServerGroup Designation

See the following sections for detailed information on alarms and events:

• Diameter Alarms and Events (8000-8299, 22000-22350, 22900-22999,25600-25899)

• Range Based Address Resolution (RBAR) Alarms and Events (22400-22424)

• Generic Application Alarms and Events (22500-22599)

Diameter Alarms and Events (8000-8299, 22000-22350,22900-22999, 25600-25899)

8000 - MpEvFsmException

8000 - 001 - MpEvFsmException_SocketFailureEvent Type:DIAM

Description:DraWorker connection FSM exception.

SeverityInfo

Instance<DraWorker Name>:001

HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterMpEvFsmException

1. Recovery

1. This event is potentially caused by the Peer CNDRA process reaching itsdescriptor capacity.

Chapter 5Diameter Alarms and Events (8000-8299, 22000-22350, 22900-22999, 25600-25899)

5-11

2. This event is unexpected. It is recommended to contact My Oracle Support forassistance.

8000 - 002 - MpEvFsmException_BindFailureEvent TypeDIAM

DescriptionDraWorker connection FSM exception.

SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

1. Potential causes of this event are:

• Network interface(s) are down.

• Port is already in use by another process.

• Configuration is invalid.


8000 - 003 - MpEvFsmException_OptionFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


5-12


1. Recovery

1. Potential causes of this event are:

• Peer CNDRA process is not running with root permission.



8000 - 004 - MpEvFsmException_AcceptorCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is potentially caused by a network or upgrade event that resulted in asynchronization of peer connection attempts.

Note:

The rate will ease over time as an increasing number of connections areaccepted.

8000 - 101 - MpEvFsmException_ListenFailureEvent TypeDIAM



5-13

SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is unexpected. It is recommended to contact My Oracle Support forassistance.

8000 - 102 - MpEvFsmException_PeerDisconnectedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• No action required.

8000 - 103 - MpEvFsmException_PeerUnreachableEvent TypeDIAM



5-14

SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• Potential causes for this event are:

• A host IP interface is down.

• A host IP interface is unreachable from the peer.

• A peer IP interface is down.

• A peer IP interface is unreachable from the host.

8000 - 104 - MpEvFsmException_CexFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


• The peer is misconfigured.

• The host is misconfigured.


5-15

8000 - 105 - MpEvFsmException_CerTimeoutEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8000 - 106 - MpEvFsmException_AuthenticationFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery




5-16


8000 - 201 - MpEvFsmException_UdpSocketLimitEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• The Peer CNDRA supports to a preconfigured maximum number of open UDPsockets. One or more peers are being routed more traffic than is normallyexpected, or the peers are responding slowly, causing more than the usualnumber of UDP sockets being opened. The concerned peer can be identifiedusing the reported connection ID. Investigate the reason for higher than normaltraffic being forwarded to the peer, or why the peer is slow to respond.

8001 - MpEvException

8001 - 001 - MpEvException_OversubscribedEvent TypeDIAM

DescriptionDraWorker exception.

SeverityInfo


HA ScoreNormal

Throttle SecondsNone


5-17

OIDeagleXgDiameterMpEvException

1. Recovery

• Bounce one or more floating connections to force their migration to anotherDraWorker with available capacity.

8002 - MpEvRxException

8002 - 001 - MpEvRxException_DiamMsgPoolCongestedEvent TypeDIAM

DescriptionDA-MP ingress message processing exception.

SeverityInfo

Instance<DA-MP Name>:001

HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterMpEvRxException

1. Recovery

• Potential causes of this event are:

• One or more DA-MPs are unavailable and traffic has been distributed to theremaining DA-MPs.

• One or more peers are generating more traffic than is nominally expected.

• There are an insufficient number of DA-MPs provisioned.

• One or more peers are answering slowly, causing a backlog of pendingtransactions.

8002 - 002 - MpEvRxException_MaxMpsExceededEvent TypeDIAM

DescriptionDraWorker ingress message processing exception.


5-18

SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is potentially caused when a peer is generating more traffic than isnominally expected.

8002 - 003 - MpEvRxException_CpuCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery



• Configuration requires more CPU for message processing than is nominallyexpected.

• One or more peers are answering slowly, causing a backlog of pendingtransactions


5-19

8002 - 004 - MpEvRxException_SigEvPoolCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8002 - 005 - MpEvRxException_DstMpUnknownEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery



5-20

8002 - 006 - MpEvRxException_DstMpCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery



• Configuration requires more CPU for message processing than is nominallyexpected.


8002 - 007 - MpEvRxException_DrlReqQueueCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


5-21


1. Recovery


8002 - 008 - MpEvRxException_DrlAnsQueueCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8002 - 009 - MpEvRxException_ComAgentCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


5-22


1. Recovery


8002 - 201 - MpEvRxException_MsgMalformedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may have an implementation defect.

8002 - 202 - MpEvRxException_PeerUnknownEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


5-23


1. Recovery

• The host or peer may be misconfigured. Adjust the peer IP address(es) option ofthe associated Peer Node if necessary.

8002 - 204 - MpEvRxException_ItrPoolCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. Adjust the RADIUS Cached Response Duration option of the associatedConnection configuration set(s) to reduce the lifetime of cached transactions, ifneeded.

2. If one or more MPs in a server site have failed, the traffic will be distributedbetween the remaining MPs in the server site.

3. The mis-configuration of Diameter peers may result in too much traffic beingdistributed to the MP. Each MP in the server site should be receivingapproximately the same ingress transaction per second.

4. There may be an insufficient number of MPs configured to handle the networktraffic load. If all MPs are in a congestion state then the offered load to the serversite is exceeding its capacity.

5. A software defect may exist resulting in PTR buffers not being deallocated to thepool. This alarm should not normally occur when no other congestion alarms areasserted. The alarm log should be examined.

6. If the problem persists, it is recommended to contact My Oracle Support.

8002 - 207 - MpEvRxException_ReqDuplicateEvent TypeDIAM


5-24

DescriptionConnection ingress message processing exception.

SeverityInfo

Instance<Connection Name>:207

HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. It is possible to observe this event occasionally, due to the unreliable nature of theUDP transport protocol. However, if the occurrence of this event is frequent,investigate the issue further.

This event is expected when a retransmission is received from the client before aserver has responded to the request, possibly a result of the client retransmittingtoo quickly before allowing sufficient time for a server to respond in time. Anotherpossible cause is if one or more servers configured to handle the request are non-responsive.

2. Investigate the routing configuration to narrow down the list of servers (PeerNodes) which are expected to handle requests from the reported serverconnection.

3. Evaluate whether an Egress Transaction Failure Rate alarm has been raised forany of the corresponding client connections. If so, investigate the cause of theserver becoming non-responsive and address the condition.

Note:

Depending on the operator's choice, the client connection may need tobe Admin Disabled until the evaluation is complete, which will allowrequests to be routed to other servers, depending on the routingconfiguration. If this is not the case, tune the client's retransmit timers tobe greater than the typical turnaround time for the request to beprocessed by the server and for the response to be sent back to theclient.



5-25

8003 - MpEvTxException

8003 - 001 - MpEvTxException_ConnUnknownEvent TypeDIAM

DescriptionDraWorker egress message processing exception.

SeverityInfo


HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterMpEvTxException

1. Recovery


8003 - 101 - MpEvTxException_DclTxTaskQueueCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


5-26

• This event is potentially caused by one or more peers being routed more trafficthan is nominally expected.

8003 - 202 - MpEvTxException_EtrPoolCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. Adjust the Diameter configuration set(s) to reduce the lifetime of pendingtransactions, if needed.




5. A software defect may exist resulting in PTR buffers not being deallocated to thepool. This alarm should not normally occur when no other congestion alarms areasserted.


8004 - EvFsmAdState

8004 - 001 - EvFsmAdState_StateChangeEvent TypeDIAM


5-27

DescriptionConnection FSM administrative state change.

SeverityInfo


HA ScoreNormal


OIDeagleXgDiameterEvFsmAdState

1. Recovery


8005 - EvFsmOpState

8005 - 001 - EvFsmOpState_StateChangeEvent TypeDIAM

DescriptionConnection FSM operational state change.

SeverityInfo


HA ScoreNormal


OIDeagleXgDiameterFsmOpState

1. Recovery

1. No action required when operationally available.

2. Potential causes for this event when operationally unavailable are:

• Connection is administratively disabled.

• Diameter initiator connection is connecting.

• Diameter initiator connection is suppressed (peer is operationally available).


5-28

• Diameter initiator connection is suppressed (peer did not signal reboot duringgraceful disconnect).

• Diameter responder connection is listening.

• RADIUS server connection is opening.

3. Potential causes for this event when operationally degraded are:

• Connection egress message rate threshold crossed.

• Diameter connection is in watchdog proving.

• Diameter connection is in graceful disconnect.

• Diameter peer signaled remote busy.

• Diameter connection is in transport congestion.

8006 - EvFsmException

8006 - 001 - EvFsmException_DnsFailureEvent TypeDIAM

DescriptionConnection FSM exception.

SeverityInfo


HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterEvFsmException

1. Recovery

• Potential causes of this event are:

• DNS server configuration is invalid.

• DNS server(s) are unavailable.

• DNS server(s) are unreachable.

• FQDN configuration is invalid.

8006 - 002 - EvFsmException_ConnReleasedEvent TypeDIAM


5-29


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8006 - 101 - EvFsmException_SocketFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

1. This event is potentially caused by the Peer CNDRA process reaching itsdescriptor capacity.



5-30

8006 - 102 - EvFsmException_BindFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

1. Potential causes for this event are:

• Network interface(s) are down.

• Port is already in use by another process.



8006 - 103 - EvFsmException_OptionFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


5-31


1. Recovery

1. Potential causes for this event are:

• Peer CNDRA process is not running with root permission.



8006 - 104 - EvFsmException_ConnectFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8006 - 105 - EvFsmException_PeerDisconnectedEvent TypeDIAM


SeverityInfo



5-32

HA ScoreNormal

Throttle Seconds10


1. Recovery

• No action required. Potential causes for this event are:

• Diameter peer signaled DPR.

• Peer is unavailable.

8006 - 106 - EvFsmException_PeerUnreachableEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


• A host IP interface is down.

• A host IP interface is unreachable from the peer.

• A peer IP interface is down.

• A peer IP interface is unreachable from the host.

8006 - 107 - EvFsmException_CexFailureEvent TypeDIAM


5-33


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery




8006 - 108 - EvFsmException_CeaTimeoutEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery



5-34

8006 - 109 - EvFsmException_DwaTimeoutEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8006 - 110 - EvFsmException_DwaTimeoutEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery



5-35

8006 - 111 - EvFsmException_ProvingFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


• A host IP interface is unreachable from the peer, or intermittently so.

• A peer IP interface is unreachable from the host, or intermittently so.

8006 - 112 - EvFsmException_WatchdogFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


5-36


• A host IP interface is unreachable from the peer, or intermittently so.

• A peer IP interface is unreachable from the host, or intermittently so.

8006 - 113 - EvFsmException_AuthenticationFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery




8007 - EvException

8007 - 101 - EvException_MsgPriorityFailureEvent TypeDIAM

DescriptionConnection exception.

SeverityInfo


HA ScoreNormal


5-37

Throttle Seconds10

OIDeagleXgDiameterEvException

1. Recovery

• This event is potentially caused by misconfiguration of the host.

8008 - EvRxException

8008 - 001 - EvRxException_MaxMpsExceededEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterEvRxException

1. Recovery

• This event is potentially caused when a peer is generating more traffic than isnominally expected.

8008 - 101 - EvRxException_MsgMalformedEvent TypeDIAM


SeverityInfo



5-38

HA ScoreNormal

Throttle Seconds10


1. Recovery


8008 - 102 - EvRxException_MsgInvalidEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery


8008 - 202 - EvRxException_MsgAttrLenUnsupportedEvent TypeDIAM


SeverityInfo



5-39

HA ScoreNormal

Throttle Seconds10


1. Recovery:


8008 - 203 - EvRxException_MsgTypeUnsupportedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may have an implementation defect or may bemisconfigured.

8008 - 204 - EvRxException_AnsOrphanedEvent TypeDIAM


SeverityInfo


5-40


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• The peer is responding slowly, network latency is high, or the ETR timer isconfigured too small. Adjust the Diameter configuration set(s) to reduce thelifetime of pending transactions, if needed.

8008 - 205 - EvRxException_AccessAuthMissingEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


8008 - 206 - EvRxException_StatusAuthMissingEvent TypeDIAM



5-41

SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


8008 - 207 - EvRxException_MsgAuthInvalidEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. Evaluate the indicated message. If an invalid message authenticator value isindicated, ensure that the same shared secret is configured for the connection onthe Peer CNDRA and on the RADIUS peer.

2. If an invalid message authenticator value is not indicated, then the peer may havean implementation defect or may be misconfigured. It is recommended to contact My Oracle Support for assistance. This event is unexpected.


5-42

8008 - 208 - EvRxException_ReqAuthInvalidEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may be misconfigured.

8008 - 209 - EvRxException_AnsAuthInvalidEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may be misconfigured.


5-43

8008 - 210 - EvRxException_MsgAttrAstUnsupportedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may have an implementation defect or may bemisconfigured .

2. Only certain Acct-Status-Type values are supported. Ensure that the Acct-Status-Type value is one of these values:

• 1 (Start)

• 2 (Stop)

• 3 (Interim-Update)

• 7 (Accounting-On)

• 8 (Accounting-Off)

8008 - 212 - EvRxException_MsgTypeMissingMccsEvent TypeDIAM


SeverityInfo


HA ScoreNormal


5-44

Throttle Seconds10


1. Recovery:

• It is recommended to contact My Oracle Support for assistance. The peer or hostis misconfigured.

8008 - 213 - EvRxException_ConnUnavailableEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

• No action required. This event is for informational purposes only.

8009 - EvTxException

8009 - 001 - EvTxException_ConnUnavailableEvent TypeDIAM

DescriptionConnection egress message processing exception.

SeverityInfo



5-45

HA ScoreNormal

Throttle Seconds10

OIDeagleXgDiameterEvTxException

1. Recovery


8009 - 101 - EvTxException_DclTxConnQueueCongestedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is potentially caused by a peer being routed more traffic than isnominally expected.

8009 - 102 - EvTxException_DtlsMsgOversizedEvent TypeDIAM


SeverityInfo



5-46

HA ScoreNormal

Throttle Seconds10


1. Recovery

• This event is potentially caused by a peer being routed more traffic than isnominally expected.

8009 - 201 - EvTxException_MsgAttrLenUnsupportedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


8009 - 202 - EvTxException_MsgTypeUnsupportedEvent TypeDIAM


SeverityInfo



5-47

HA ScoreNormal

Throttle Seconds10


1. Recovery:

• This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may have an implementation defect, or may bemisconfigured.

8009 - 203 - EvTxException_MsgLenInvalidEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


2. This event is typically generated when the Peer CNDRA needs to add a Message-Authenticator to the message, but doing so causes the message size to exceedmaximum RADIUS message length. If this problem persists, evaluate the sourceof this message and ensure that the message size allows adding a Message-Authenticator attribute (16 octets). Evaluate the message authenticatorconfiguration for the egress connection and ensure that the adding of Message-Authenticator to specific message types is configured appropriately.

8009 - 204 - EvTxException_ReqOnServerConnEvent TypeDIAM


5-48


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. This event is unexpected. It is recommended to contact My Oracle Support forassistance. The peer may be misconfigured.

2. Review the configuration of Route Groups and ensure that there are no RADIUSserver instances.

8009 - 205 - EvTxException_AnsOnClientConnEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


2. Review the configuration of Connections and ensure that there are no RADIUSclient instances being used as a RADIUS server by one or more peers.


5-49

8009 - 206 - EvTxException_DiamMsgMisroutedEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


2. Review the configuration of Route Groups and ensure that there are no RADIUSserver instances.

8009 - 207 - EvTxException_ReqDuplicateEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:


5-50


8009 - 208 - EvTxException_WriteFailureEvent TypeDIAM


SeverityInfo


HA ScoreNormal

Throttle Seconds10


1. Recovery:

1. This event is unexpected. It is recommend to contact My Oracle Support forassistance. The peer may be misconfigured.

2. Ensure that the RADIUS UDP Transmit Buffer Size is sufficient for the offeredtraffic load.

8010 - MpIngressDropAlarm Group:DIAM

Description:An ingress message is discarded or rejected.

Severity:Major

Instance:<DraWorker Name>

HA Score:Normal

Auto Clear Seconds:30

OID:eagleXgDiameterMpIngressDrop


5-51

Cause:An ingress message is discarded or rejected in the following congestion scenarios:

• Connection maximum message rate exceeded (ingress control).

• DraWorker maximum message rate exceeded (ingress control).

• DraWorker CPU congestion (overload control).

• Diameter message pool congested (routing ingress).

• Signaling event pool congested (routing ingress).

• Destination DraWorker unknown (routing ingress).

• Destination DraWorker congested (routing ingress).

• DRL request message queue congested (routing ingress).

• DRL answer message queue congested (routing ingress).

Diagnostic Information:Collect the following information to diagnose the cause before contacting OracleSupport:

• Event History on active SO server.

• Savelogs of all MPs.

• Peer CNDRA logs of all MPs.

1. Recovery:

• Potential causes of this alarm are:

• One or more DraWorkers are unavailable and traffic has been distributed tothe remaining DraWorkers.


• There are an insufficient number of DraWorkers provisioned.


8011 - EcRateAlarm Group:DIAM

Description:Connection egress message rate threshold crossed.

Severity:Minor, Major, Critical

Instance:<Connection Name>

HA Score:Normal


5-52

Auto Clear Seconds:0 (zero)

OID:eagleXgDiameterEmr

Cause:Connection egress message rate threshold crossed.



• Savelogs of the MP server.

• Peer CNDRA logs of the MP server.

1. Recovery:

1. This alarm is potentially caused when a peer has routed more traffic than isnominally expected.

2. Inability of the adjacent Diameter Peer to handle the rate of egress message trafficcurrently being offered on a connection.

3. TCP/SCTP buffers filling up on the egress side.

8012 - MpRxNgnPsOfferedRateAlarm Group:DIAM

Description:DraWorker ingress NGN-PS message rate threshold crossed.

Severity:Major

Instance:MpRxNgnPsOfferedRate, DIAM

HA Score:Normal


OID:eagleXgDiameterMpRxNgnPsOfferedRateNotify

Cause:DraWorker ingress NGN-PS message rate threshold crossed. The alarm clears whenthreshold crossing abates.

Diagnostic Information:N/A


5-53

1. Recovery:

1. Check for one or more DraWorkers is unavailable and traffic has been distributedto the remaining DraWorkers.

2. Check for one or more peers is generating more traffic than is nominally expected.

3. Check for an insufficient number of DraWorkers provisioned.

4. This alarm clears when the treshold crossing abates.

8013 - MpNgnPsStateMismatchAlarm Group:DIAM

Description:DraWorker NGN-PS administrative and operational state mismatch.

Severity:Major


HA Score:Normal


OID:eagleXgDiameterMpNgnPsStateMismatch

Cause:The alarm raises when the administrative state of NGN-PS is not aligned with theoperational state. Alarm clears when the administrative and operational states arealigned.


• The details of active SO server.


1. Recovery:

1. This alarm is potentially caused when a DraWorker restart is required.

The alarm clears when the administrative and operational states are aligned.

2. If the NGN-PS feature is mistakenly activated, disable the feature to clear thealarm and align the operational state with administrative state .

3. If the NGN-PS feature is mistakenly de-activated, enable the feature to clear thealarm and align the operational state with administrative state.


5-54

8014 - MpNgnPsDropAlarm Group:DIAM

Description:DraWorker NGN-PS message discarded or rejected.

Severity:Major


HA Score:Normal


OID:eagleXgDiameterMpNgnPsDrop

Cause:Each layer involved in processing an NGN-PS transaction may reject or discard arequest or answer. Such scenarios include:

• Routing or application controls.

• Peer or network congestion.

• Internal processing error.

• Task queue or resource congestion or ComAgent congestion or delivery failure.

• Processing error.



• Savelogs of all MPs.

• Peer CNDRA logs of all MPs.

1. Recovery


• Routing or application controls are configured incorrectly.

• Peer or network is in congestion.

• Engineering of internal resources is insufficient.


5-55

8015 - NgnPsMsgMisroutedAlarm Group:DIAM

Description:NGN-PS message routed to peer CNDRA lacking NGN-PS support.

Severity:Major


HA Score:Normal

Auto Clear Seconds30

OID:eagleXgDiameterNgnPsMsgMisrouted

Cause:An NGN-PS message routed to a peer CNDRA lacking NGN-PS support, and will notbe processed as intended.

Diagnostic Information:Collect the following before contacting Oracle Support:

• Event history on active SO server.

• Software release information of dra-Worker's on the dra-Worker server.

1. Recovery


• Routing configuration is incorrect.

• Peer CNDRA has not yet been upgraded.

• Peer CNDRA has not yet operationally enabled NGN-PS.

8016 - MpP16StateMismatchAlarm Group:DIAM

Description:MP P16 Support administrative and operational state mismatch.

Severity:Major


5-56

Instance:<MP Name>

HA Score:Normal


OID:eagleXgDiameterMpP16StateMismatch

Cause:The administrative state of P16 support is not aligned with the operational state.

Diagnostic Information:Collect the following before contacting Oracle Support:

• Screenshot of active SO server.


1. Recovery

1. Potential causes of this alarm are:

• An MP restart is required.

• If the 16 Priority Support is mistakenly activated, disable the feature to clearthe alarm and align the operational state with administrative state.

• If the 16 Priority Support is mistakenly de-activated, enable the feature to clearthe alarm and align the operational state with administrative state.

2. Alarm clears when the administrative and operational states are aligned.

8017 - MpTaskCpuCongestedAlarm GroupDIAM

DescriptionDraWorker Task CPU utilization threshold crossed

SeverityMinor, Major, Critical

InstanceTask Name

HA ScoreNormal


OIDeagleXgDiameterMpTaskCpuCongested

1. Recovery


5-57


• One or more peers are generating more traffic than is nominally expected

• Configuration requires more CPU for message processing than is nominallyexpected

8018 - P16MsgMisroutedAlarm GroupDIAM

Description16 priority message routed to peer CNDRA lacking 16 priority support

SeverityMajor

Instance<Connection Name>

HA ScoreNormal


OIDeagleXgDiameterP16MsgMisrouted

1. Recovery


• Peer CNDRA has not yet been upgraded.

• Peer CNDRA has not yet operationally enabled 16 priority support.

8019 - MpAnswerPriorityModeMismatchAlarm GroupDIAM

DescriptionDraWorker Answer Priority Mode administrative and operational state mismatch.

SeverityMajor

Instance<DraWorker Name>

HA ScoreNormal


5-58


OIDeagleXgDiameterMpAnswerPriorityModeMismatch

1. Recovery


• A DraWorker restart is required.

8020 - MpRoutingThreadPoolStateMismatchAlarm GroupDIAM

DescriptionRouting Thread Pool administrative and operational state mismatch.

SeverityMinor

Instance<DraWorker Name>

HA ScoreNormal

Auto Clear Seconds0 (zero)

OIDeagleXgDiameterMpRoutingThreadPoolStateMismatch

1. Recovery

• This alarm is potentially caused when a DraWorker restart is required.

The alarm clears when administrative and operational states are aligned.

8100 - NormMsgMisroutedAlarm Group:DIAG

Description:Normal message routed onto diagnostic connection.

Severity:Major



5-59

HA Score:Normal

Auto Clear Seconds:30 (after last occurrence)

OID:eagleXgDiameterNormMsgMisrouted

1. Recovery:

1. The alarm is potentially caused by a diameter routing misconfiguration.


8101 - DiagMsgMisroutedAlarm Group:DIAG

Description:Diagnostic message routed onto normal connection.

Severity:Minor


HA Score:Normal

Auto Clear Seconds:30 (after last occurrence)

OID:eagleXgDiameterDiagMsgMisrouted

1. Recovery:

1. The alarm is potentially caused by a diameter routing misconfiguration.


22001 - Message Decoding FailureEvent Type:DIAM

Description:A message received from a peer was rejected because of a decoding failure.Decoding failures can include missing mandatory parameters.

Severity:Info


5-60

Instance:<TransConnName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterIngressMsgRejectedDecodingFailureNotify

1. Recovery:

• During Diameter Request decoding, the message content was inconsistent withthe "Message Length" in the message header. This protocol violation can becaused by the originator of the message (identified by the Origin-Host AVP in themessage) or the peer who forwarded the message to this node.

22002 - Peer Routing Rules with Same PriorityEvent Type:DIAM

Description:A peer routing table search with a received Request message found more than onehighest priority Peer Routing Rule match. The system selected the first rule found butit is not guaranteed that the same rule will be selected in the future. It isrecommended that Peer Routing Rules be unique for the same type of messages toavoid non-deterministic routing results.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterPeerRoutingTableRulesSamePriorityNotify

1. Recovery:

• Modify one of the Peer Routing Rule Priorities.

22003 - Application ID Mismatch with PeerEvent Type:DIAM


5-61

Description:While attempting to route a request message to a peer, a peer's transport connectionwas bypassed because the peer did not support the Application ID for that transportconnection.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterApplicationIdMismatchWithPeerNotify

1. Recovery:

1. The system's peer routing table may be using a Route List containing a peer whichdoes not support the Application ID or the list of Application IDs supported by thepeer on each connection may not be the same. View the list of Application IDs thatthe peer supports on each connection and if the Application IDs are not the samefor each connection (but should be), the Application ID for any connection can berefreshed by disabling or enabling the connection.

2. The Diameter Node which originated the message (identified by the Origin-HostAVP) could be configured incorrectly and the application is trying to address anode which doesn't support the Application ID. This cannot be fixed using thisapplication.

3. If the problem persists, contact #unique_119.

22004 - Maximum pending transactions allowed exceededEvent Type:DIAM

Description:Routing attempted to select an egress transport connection to forward a message butthe maximum number of allowed pending transactions queued on the connection hasbeen reached.

Severity:Info


HA Score:Normal

Throttle Seconds:10


5-62

OID:eagleXgDiameterMaxPendingTxnsPerConnExceededNotify

1. Recovery:

• The maximum number of pending transactions for each connection is set to asystem-wide default value. If this event is occurring frequently enough for aparticular connection then the maximum value may need to be increased. It isrecommended to contact My Oracle Support for assistance.

22005 - No peer routing rule foundEvent Type:DIAM

Description:A message not addressed to a peer (either Destination-Host AVP was absent orDestination-Host AVP was present but was not a peer's FQDN) could not be routedbecause no Peer Routing Rules matched the message.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterNoPrtRuleNotify

Cause:Ingress-request message from a downstream peer is rejected by a Local Node whenno peer-routing rules are found in the Peer Routing Table (PRT) and one of thefollowing is true:

• The ingress-request message did not contain a Destination-Host AVP or

• The ingress-request message contained a Destination-Host AVP but did notmatch with any configured peer node's FQDN or

• Destination-Realm AVP value and the Application-ID in the request messageheader did not match with configured Realm/Application-Id in Realm Route Table

The Realm Route Table (table RealmRoute) managed object is used to performmessage routing based upon the Destination-Realm and Application-ID in a requestmessage. The Realm Route Table is dynamically configured on the active Overseer.

Diagnostic Information:Analyze the event history and event #22005 which will have following informationregarding the failure diameter message:

• <TransConnName> (Receiving connection)


5-63

• <PeerName> (Name of the receiving peer )

• <DestRealm> (Value found in Request message Destination-Realm AVP)

• <ApplicationID> (Application ID in the Request message)

• <DestHostFQDN> (FQDN found in request message Destination-Host AVP, ifpresent)

• <OriginHostFQDN> (FQDN found in request message Origin-Host AVP)

The Diameter Ingress Transaction Exception group measurement report contains theRxNoRulesFailure (10034) measurement, which is also pegged in the same scenario.

1. Recovery:

1. Either the message was incorrectly routed to this node or additional Peer RoutingRules need to be added. View and update the Peer Routing Rules.

2. If multiple peer routing tables are used, ensure the correct table is applied for themessage in question.


22007 - Inconsistent Application ID Lists from a PeerEvent Type:DIAM

Description:The list of Application IDs supported by a peer during the Diameter CapabilitiesExchange procedure on a particular transport connection is not identical to one of thelist of Application IDs received from the peer over a different available transportconnection to that peer.

Severity:Info

Instance:<PeerName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterSupportedAppIdsInconsistentNotify

1. Recovery:

1. A peer with multiple transport connections has established a connection andprovided a list of supported Application IDs which does match a previouslyestablished connection. This could prevent Request messages from being routeduniformly over the peer's transport connections because the decision to route amessage containing an Application ID is based upon the list of Application IDssupported on each transport connection. View the list of Application IDs that thepeer supports on each connection and if the Application IDs are not the same for


5-64

each connection (but should be), the Application ID for any connection can berefreshed by disabling or enabling the connection.


22008 - Orphan Answer Response ReceivedEvent Type:DIAM

Description:An answer response was received for which no pending request transaction existed,resulting in the answer message being discarded. When a Request message isforwarded the system saves a pending transaction, which contains the routinginformation for the answer response. The pending transaction is abandoned if ananswer response is not received in a timely fashion.

Severity:Info


HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterOrphanAnswerResponseReceivedNotify

Cause:An answer message is received without any corresponding pending transaction. Themessage is discarded.

Diagnostic Information:Reasons the pending transaction is not available include:

• Peer CNDRA's Tx sender buffer is filling up causing connection congestion.

• PAT expiry or total transaction life-time expiry is causing transaction timeout.

The associated measurement tag for this event is RxAnswerUnexpected (10008),which is the number of times that the DRL receives an answer message event fromDCL/RCL with a valid Connection ID for which a pending transaction cannot be found.

1. Recovery:

• If this event is occurring frequently, the transaction timers may be set too low.

22009 - Application Routing Rules with Same PriorityEvent Type:DIAM


5-65

Description:An application routing table search with a received Request message found morethan one highest priority application routing rule match. At least two applicationrouting rules with the same priority matched an ingress Request message. Thesystem selected the first application routing rule found.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterApplicationRoutingTableRulesSamePriorityNotify

1. Recovery:

1. It is recommended that application routing rules be unique for the same type ofmessages to avoid unexpected routing results.


22010 - Specified DAS Route List not provisionedEvent Type:DIAM

Description:The DAS Route List specified by the message copy trigger point is not provisioned.

Severity:Info

Instance:<RouteListId>

HA Score:Normal

Throttle Seconds:10

Note:

Because many route lists can be created on a DraWorker server, care mustbe taken to prevent excessive event generation with these resources.


5-66

OID:eagleXgDiameterSpecifiedDasRouteListNotProvisionedNotify

1. Recovery:

1. Provisioning is incorrect/misconfigured. Verify provisioning and provision/correctprovisioning.

2. If this problem persists, it is recommended to contact My Oracle Support forassistance.

22014 - No DAS Route List specifiedAlarm Group:DIAM

Description:No valid DAS Route List was specified in the Message Copy Config Set.

Severity:Info

Instance:<RouteListId>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterNoDasRouteListSpecifiedNotify

1. Recovery:

• It is recommended to contact My Oracle Support for further assistance.

22012 - Specified MCCS not provisionedEvent Type:DIAM

Description:The Message Copy Config Set specified by the trigger point is not provisioned.

Severity:Info

Instance:<MCCS>

HA Score:Normal


5-67

Throttle Seconds:10

OID:eagleXgDiameterSpecifiedMCCSNotProvisionedNotify

1. Recovery:

1. Verify the configured value of MCCS with the trigger point.

2. Verify the Message Copy CfgSet (MCCS) provisioning is properly configured.


22016 - Peer Node Alarm Aggregation ThresholdAlarm Group:DIAM

Description:This alarm occurs when there are a critical number of peer node alarms for a singlenetwork element and it exceeds the configurable alarm threshold.

Note:

The alarm thresholds are configurable using the Alarm Threshold Optionstab on Diameter, and then Configuration, and then System Options.

When this alarm is generated, the system clears all individual peer node alarms(alarm 22051) for the peer node.

Severity:Critical

Instance:<NetworkElement>

HA Score:Normal


OID:eagleXgDiameterPeerNodeUnavailableThresholdReachedNotify

Cause:The number of critical peer node alarms for a single network element exceeds theconfigurable alarm threshold.

Diagnostic Information:Refer to Alarm 22051- Peer Unavailable. When this alarm is reported, the systemclears all the individual peer node alarms (alarm 22051) for the peer node.

1. Recovery:


5-68

1. Check the peer status.

2. Verify IP network connectivity exists between the MP server and the peer node.

3. Check the event history logs for additional DIAM events or alarms from this MPserver.

4. Verify the peer is not under maintenance.

5. It is recommended to contact My Oracle Support for assistance.

22017 - Route List Alarm Aggregation ThresholdAlarm Group:DIAM

Description:This alarm occurs when there are a ‘Critical’ number of Route List alarms for theNetwork Element.

Severity:Critical

Instance:<NetworkElement>

HA Score:Normal


OID:eagleXgDiameterRouteListUnavailableThresholdReachedNotify

Cause:The alarm # 22017 raises when the total number of Route List alarms for a single NEhave reached the configured Route List Failure Critical Aggregation Alarm Threshold.

The alarm gets cleared when the total number of Route List alarms for a single NEhave dropped to at least 20% below the configured Route List Failure CriticalAggregation Alarm Threshold.

Diagnostic Information:For further information on this alarm:

1. Examine the alarm log on Active Overseer Server.

2. Find all the route lists with a problem for the specific MP.

3. A Route List's operational status is always set to the operational status of theRoute Group within the Route List that is designated as the Active Route Group.

4. If all Route Groups within the route list are Unavailable, then the Route List isUnavailable and there is no Active Route Group.

1. Recovery:

1. View the Route List to monitor Route List status.

2. Verify that IP network connectivity exists between the MP server and the peers.


5-69


4. Verify that the peers in the Route List are not under maintenance.


22013 - DAS Peer Number of Retransmits Exceeded for CopyEvent Type:DIAM

Description:The configured number of Message Copy retransmits has been exceeded for theDAS Peer.

Severity:Info

Instance:<MCCS>

HA Score:Normal

Throttle Seconds:10

Note:

Because many route lists can be created on a DraWorker server, care mustbe taken to prevent excessive event generation with these resources.

OID:eagleXgDiameterNumberOfRetransmitsExceededToDasNotify

1. Recovery:

1. Verify the configured value of 'Max Retransmission Attempts'

2. Verify local provisioning to connections to intended DAS peer server(s) are inservice and no network issues in path(s) to intended DAS peer server(s) exist.

3. Verify DAS peer provisioning to insure proper configuration.

4. If the problem persists, it is recommended to contact My Oracle Support forassistance.

22018 - Maintenance Leader HA Notification to go ActiveAlarm Group:DIAM


5-70

Description:This alarm occurs when a DraWorker has received a notification from HA that theMaintenance Leader resource should transition to the Active role.

Severity:Info

Instance:<MP Node ID>

HA Score:Normal

Throttle Seconds:1

OID:eagleXgDiameterDaMpLeaderGoActiveNotificationNotify

1. Recovery:

• No action necessary.

22019 - Maintenance Leader HA Notification to go OOSAlarm Group:DIAM

Description:This alarm occurs when a DraWorker has received a notification from HA that theMaintenance Leader resource should transition to the OOS role.

Instance:<MP Node ID>

Severity:Info

HA Score:Normal

Throttle Seconds:1

OID:eagleXgDiameterDaMpLeaderGoOOSNotificationNotify

1. Recovery:


22020 - Copy Message size exceeded the system configured size limitEvent Type:DIAM


5-71

Description:The generated Copy message size exceeded the max message size on the system.

Severity:Info

Instance:<DraWorker>

HA Score:Normal

Throttle Seconds:10

Note:

Because many copy messages can exceed the system configured size,care must be taken to prevent excessive generation with these resources.

OID:eagleXgDiameterCopyMessageSizeExceededNotify

1. Recovery:

1. Verify the size of the Request and Answer messages and see it exceeds thesystem set message size.

2. Review provisioning and correct provisioning and see whether answers alsoneeded to copy.

Requests and answers may be copied to DAS.

3. If this problem persists, it is recommended to contact My Oracle Support forassistance.

22021 - Debug Routing Info AVP EnabledAlarm Group:DIAM

Description:Debug Routing Info AVP is enabled.

Severity:Minor

Instance:None

HA Score:Normal



5-72

OID:eagleXgDiameterDebugRoutingInfoAvpEnabledNotify

1. Recovery:

1. Change the IncludeRoutingInfoAvp parameter to no in the DpiOption tableon the NO for a 2-tier system or on the SO for a 3-tier system.


22022 - Forwarding Loop DetectedAlarm Group:DIAM

Description:Ingress Request message received was previously processed by the local node asdetermined from the Route-Record AVPs received in the message.

Severity:Major

Instance:<Peer Name>

HA Score:Normal


OID:eagleXgDiameterForwardingLoopDetectedNotify

1. Recovery:

1. An ingress request message was rejected because message looping wasdetected. In general, the forwarding node should not send a message to a peerthat has already processed the message (it should examine the Route-RecordAVPs before message forwarding). If this type of error is occurring frequently, thenthe forwarding node is most likely mis-routing the message. This should not berelated to a configuration error because the identity of the local node is sent to thepeer during the Diameter Capabilities Exchange procedure when the Connectioncomes into service.

2. If Path Topology Hiding is activated and Protected Network Node's Route-Recordsare obscured with PseudoNodeFQDN, then inter-network ingress message loopdetection could reject the message if same Request message is routed back toDEA. If this type of error is occurring then the forwarding node is most likely mis-routing the message back to DEA.


22051 - Peer UnavailableAlarm Group:DIAM


5-73

Description:Unable to access the Diameter Peer because all of the transport connections aredown. Peer node unavailability can happen in these cases:

• All connections toward a peer are no longer candidates for routing Requestmessages.

• No available connections within the peer node support the Application ID. This isfunctionally equivalent to the peer node being unavailable.

• The Connection Priority Level (CPL) value for a resource is changed to 99, whichmeans the operational status is Unavailable. The CPL value of a connection canbe found in the active SO.

• The number of established connections drops below the configured MinimumConnection Capacity.

Severity:Critical

Instance:<PeerName> (of the Peer which failed).

HA Score:Normal


OID:eagleXgDiameterPeerUnavailableNotify

CauseThe Alarm #22051 raises when the Diameter Peer is not accessible as all thetransport connections are down.

Diagnostic InformationPeer node is unavailable in the following cases:

• All connections towards a peer are no longer candidates for routing Requestmessages.

• No available connections within the peer node support the Application ID. This isfunctionally equivalent to the peer node being unavailable.

• The Connection Priority Level (CPL) value for a resource is changed to 99, whichmeans the operational status is Unavailable. The CPL value of a connection canbe found in the active SO.

• The number of established connections drops below the configured MinimumConnection Capacity.

1. Recovery:

1. Confirm a connection is provisioned for the peer node.

• Verify IP network connectivity exists between the MP server and the peernodes using ping, traceroute, or other means.

• Examine the event history logs for additional DIAM events or alarms from theMP server.


5-74

• Verify the peer is not under maintenance.

• Verify there are connections provisioned for the peer node.

• Verify the status of all connections toward the peer node.View the Transaction Configuration Set of the peer node.

If the peer node has a corresponding Transaction Configuration Set setting,then confirm the Application ID is supported.

2. Confirm the peer node supports the Application ID in the request message.

3. Resolve any congestion issues on the peer node.


22052 - Peer DegradedAlarm Group:DIAM

Description:The peer has some available connections, but less than its minimum connectioncapacity. Continued routing to this peer may cause congestion or other overloadconditions.

Severity:Major

Instance:<PeerName> (of the Peer which is degraded)

HA Score:Normal


OID:eagleXgDiameterPeerDegradedNotify

Cause:

• If the number of available connections to peer node is less than minimumconnection capacity which is default 1 per Peer Node, then Peer Node Status willbe degraded, and alarm 22052 raises.

• If all the connections are degraded for the peer node, then Peer Node status willbe degraded and Alarm 22052 raises.

Diagnostic Information:

• Verify the number of available connection to that peer should be greater thanminimum connection capacity which is default 1.

• Peer CNDRA configurations on active SO

• Savelogs on active SO

• Event History on active SO

1. Recovery:


5-75

1. Check the Peer status.

2. Verify IP network connectivity exists between the MP server and the adjacentservers.



5. Make sure the number of available connections to that peer node is greater thanminimum connection capacity configured.


22053 - Route List UnavailableAlarm Group:DIAM

Description:All route groups with the route list are unavailable. A Route List becomes unavailablewhen all of its peers become unavailable and a peer becomes unavailable when all ofits transport connections become unavailable.If a Transport Connection is configured for Initiate mode, the network elementperiodically attempts to recover the connection automatically if its Admin State isenabled. If the Transport Connection is configured for Responder-Only mode, thepeer is responsible for re-establishing the transport connection.Examine the Event history and software release information for the route groups.

Severity:Critical

Instance:<RouteListName> (of the Route List which failed)

HA Score:Normal


OID:eagleXgDiameterRouteListUnavailableNotify

Cause:All route groups within the route list are unavailable. Check the Route list status.

Diagnostic InformationExamine the following for the route groups:

• Event history

• Software release information

1. Recovery:

1. Check the Route List status.

2. Verify IP network connectivity exists between the MP server and the peers.


5-76


4. Verify the peers in the route list not under maintenance.


22054 - Route List DegradedAlarm Group:DIAM

Description:The Route List's Operational Status has changed to degraded because the capacityof the Route List's active route group has dropped below the Route List's configuredminimum capacity. There are two potential causes:

1. One or more of the Route List's peers become Unavailable. A peer becomesunavailable when all of its transport connections become unavailable. If atransport connection is configured for Initiate mode, the network elementperiodically attempts to recover the connection if its admin state is enabled. If thetransport connection is configured for responder-only mode, the peer isresponsible for re-establishing the transport connection.

2. The Route Groups within the Route List may not have been configured withsufficient capacity to meet the Route List's configured minimum capacity.

Severity:Major

Instance:<RouteListName> (of the Route List which is degraded)

HA Score:Normal


OID:eagleXgDiameterRouteListDegradedNotify

Cause:There are no available Route Groups, and the Operational Status of one or moreRoute Groups within the Route List is degraded.

Diagnostic Information:A Route List's operational status is always set to the operational status of the RouteGroup within the Route List that is designated as the Active Route Group.DRL determines which Route Group within a Route List is designated the ActiveRoute Group for that Route List as follows:

• If the operational status of one or more Route Groups within the Route List isAvailable, then the Active Route Group for the Route List is the Available RouteGroup with the highest priority


5-77

• If there are no Available Route Groups, and the operational status of one or moreRoute Groups within the Route List is Degraded, the Active Route Group is theDegraded Route Group with the highest Current Capacity. If two or moredegraded Route Groups exist with equal Current Capacity, then the Active RouteGroup is the one with the highest Priority

• If all Route Groups within the route list are Unavailable, then the Route List isUnavailable and there is no Active Route Group

1. Recovery:

1. Verify Route List status and configured minimum capacity.

2. Verify IP network connectivity exists between the MP server and the peers.


4. Verify the peers in the Route List are not under maintenance.


22055 - Non-Preferred Route Group in UseAlarm Group:DIAM

Description:The application has started to utilize a Route Group other than the highest priorityRoute Group to route Request messages for a Route List because the highest priorityRoute Group specified for that Route List has either become Unavailable or itscapacity has dropped below the minimum capacity configured for the Route List whilea lower priority Route Group has more capacity.

The preferred Route Group (i.e., with highest priority) is demoted from the ActiveRoute Group to a Standby Route Group when a peer failure occurs causing the RouteGroup's Operational Status to change to Unavailable or Degraded. A Route Groupbecomes Degraded when its capacity has dropped below Route List's configuredminimum capacity. A Route Group becomes Unavailable when all of its peers have anOperational Status of Unavailable or Degraded.

A Peer becomes Unavailable when all of its transport connections becomeUnavailable. If a Transport Connection is configured for Initiate mode, the NetworkElement will periodically attempt to automatically recover the connection if its AdminState is Enabled. If the Transport Connection is configured for Responder-Only mode,the peer will be responsible for re-establishing the transport connection.

Severity:Minor

Instance:<RouteListName> (of the concerned Route List)

HA Score:Normal



5-78

OID:eagleXgDiameterNonPreferredRouteGroupInUseNotify

1. Recovery:

1. Check the Route List status and configured minimum capacity.

2. Verify that IP network connectivity exists between the MP server and the peers.


4. Verify that the adjacent server is not under maintenance.


22056 - Connection Admin State Inconsistency ExistsAlarm Group:DIAM

Description:An operator request to change the Admin State of a transport connection was notcompletely processed due to an internal error. The admin state is either disabled froman egress routing perspective but the connection could not be taken out of service orthe admin state is enabled from an egress routing perspective but the connection isnot in service.

Severity:Major


HA Score:Normal


OID:eagleXgDiameterConnAdminStateInconsistencyNotify

1. Recovery:

1. If the transport connection's Admin State is Disabled but the transport connectionwas not taken out of service due to an internal error do the following actions tocorrect the failure:

a. Enable the connection.

b. Wait for this alarm to clear.

c. Disable the connection.

2. If the transport connection's Admin State is Enabled but the transport connectionwas not taken out of service due to an internal error do the following actions tocorrect the failure:

a. Disable the connection.

b. Wait for this alarm to clear.


5-79

c. Enable the connection.


22062 - Actual Host Name cannot be determined for Topology HidingEvent Group:Diameter

Description:Topology Hiding could not be applied because the Actual Host Name could not bedetermined.

Severity:Info

Instance:<CfgSetName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterTopoHidingActualHostNameNotFoundNotify

1. Recovery:

1. Ensure that all MME/SGSN hostnames to be hidden are present in the MME/SGSN Configuration Set.

2. If any Peer CNDRA Applications are activated on Peer CNDRA, ensure that anyspecific Application Level Topology Hiding feature is not conflicting with thecontents of Actual Host Names specified in the MME Configuration Set.

3. Check if the first instance of a Session-ID AVP in the Request/Answer messagecontains the mandatory delimited ";".


22063 - Diameter Max Message Size Limit ExceededEvent Type:DIAM

Description:The size of the message encoded by Peer CNDRA has exceeded its max limits.

Severity:Info



5-80

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterDiameterMaxMsgSizeLimitExceededNotify

1. Recovery:

• No action required. However, if this event is seen to be incrementing consistently,it is recommended to contact My Oracle Support for assistance.

22064 - Upon receiving Redirect Host Notification the Request has notbeen submitted for re-routing

Event Type:DIAM

Description:This event indicates that the Peer CNDRA has encountered a Redirect HostNotification that it can accept for processing but cannot continue processing due tosome reason, such as internal resources exhaustion.

Severity:Info

Instance:<PeerName>

HA Score:Normal

Throttle Seconds:60

OID:eagleXgDiameterRxRedirectHostNotRoutedNotify

1. Recovery:

1. Examine the DraWorker congestion status and related measurements and takeappropriate action.


22065 - Upon receiving Redirect Realm Notification the Request hasnot been submitted for re-routing

Event Type:DIAM


5-81

Description:The Redirect Realm Notification received is accepted but cannot be processed due tosome reason, such as internal resources exhaustion.

Severity:Info

Instance:<PeerName>

HA Score:Normal

Throttle Seconds:60

OID:eagleXgDiameterRxRedirectRealmNotRoutedNotify

1. Recovery:

1. Examine the DraWorker congestion status and related measurements and takeappropriate action.


22071 - TtgEvLossChg

22071 - 001 - TtgEvLossChg: TTG Loss Percent Changed

Event Type:DIAM

Description:TTG's Loss Percentage was modified.

Severity:Info

Instance:<TTG Name>:001

HA Score:Normal

Throttle Seconds:0 (zero)

OID:eagleXgDiameterTtpEvDoicExceptionNotify

1. Recovery:



5-82

22075 - Message is not routed to ApplicationAlarm Group:DIAM

Description:ART Rule-X was selected, but message was not routed because Peer CNDRAApplication is disabled or not available.

Severity:Major

Instance:<Peer CNDRA Application Name>

HA Score:Normal


OID:eagleXgDiameterArtMatchAppUnavailableNotify

1. Recovery:

1. Check the Application Status and Enable the application if the Admin State of thePeer CNDRA application is Disabled for a particular DraWorker(s) which raisedthe alarm.

2. If the Application is Enabled for a particular DraWorker, but the Operational Statusis Unavailable or Degraded, then refer to the Operational Reason and rectify itaccordingly.


22077 - Excessive Request Reroute Threshold ExceededAlarm Group:DIAM

Description:Request reroutes due to Answer response and/or Answer timeout having exceededthe configured onset threshold percentage on the DraWorker server.

Severity:Major

Instance:MpReroutePercent

HA Score:Normal


5-83

Auto Clear Seconds:N/A

Note:

The alarm clears when the percentage of Request reroutes due to AnswerResult-code matching "Reroute on Answer" and Answer Timeout dropsbelow the configured abatement threshold and remains there for theconfigured abatement time. The alarm also clears when the Peer CNDRAprocess is stopped or restarted.

OID:eagleXgDiameterMpExcessiveRequestRerouteNotify

1. Recovery:

1. This alarm is an indication of reroutes exceeding the configured threshold, due toresponses from the Peer Node exceeding the Pending Answer timer in PeerCNDRA or due to configured "Reroute on Answer" Result codes.

2. If rerouting is triggered due to Answer Result-code:

a. Use measurement TxRerouteAnswerResponse to identify any peer (or set ofpeers) being identified as triggering reroute.

b. If a peer (or set of peers) is identified, validate that Reroute-on-Answer isproperly configured for that peer.

c. Check for congestion being reported by the peer.

3. If rerouting is triggered due to Answer Timeout:

a. Use measurement TxRerouteAnswerTimeout to identify any peer (or set ofpeers) being identified as timing out.

b. If a peer (or set of peers) is identified, verify that Pending Answer Timer andTransaction Lifetime are properly configured.

c. Check for congestion being reported by the peer.


22078 - Loop or Maximum Depth Exceeded in ART or PRT SearchAlarm Group:DIAM

Description:An ART/PRT search has resulted in either a loop between ART/PRT tables, or thesearch depth has exceeded the maximum allowed depth.

Severity:Info

Instance:<MPName>


5-84

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterNestedArtPrtSearchErrorNotify

1. Recovery:

1. If the error was a search loop, the customer should change at least one of therules in the search sequence to avoid a loop. If the error was a maximum depthexceeded, the customer should remove one or more rules in the search sequence.


22101 - Connection UnavailableAlarm Group:DIAM

Description:Connection is unavailable for Diameter Request/Answer exchange with peer.

Note:

This alarm is not raised when the Suppress Connection Unavailable alarmfor a Transport Connection is set to Yes.

Alarm 22101 is generated when the connection's administrative state is enabled andthe connection is not in a state where it can send or receive Diameter Requests orAnswers to/from the peer. The alarm is generated when one of the following occurs.

• Connection's Admin State transitions from disabled to enabled

• Connection's Operational Status transitions from available to unavailable

• Connection's Operational Status transitions from degraded to unavailable

Severity:Major


HA Score:Normal


OID:eagleXgDiameterConnectionUnavailableAlarmNotify


5-85

Cause:Alarm #22101 raises when the connection's administrative state is enabled and theconnection is not in a state where it can send or receive Diameter Requests orAnswers to/from the peer. The alarm is generated when one of the following occurs:

• Connection's Admin State transitions from disabled to enabled

• Connection's Operational Status transitions from available to unavailable

• Connection's Operational Status transitions from degraded to unavailable

Diagnostic Information:Confirm any of following conditions is occurring:

1. A host IP interface is down

2. A host IP interface is unreachable from the peer

3. A peer IP interface is down

4. A peer IP interface is unreachable from the host

Verify the following are configured and available:

1. Remote IP availability

2. Remote server (port) availability

3. Network availability

4. Local IP route to remove

5. Local MP service availability

6. Configuration correctness, such as CEX parameter matching with remove

1. Recovery:

1. Confirm the host IP interface is down or unreachable from the peer.

2. Confirm the peer IP interface is down or unreachable from the host.

3. Verify the following are configured and available:

• Remote IP availability

• Remote server (port) availability

• Network availability

• Local IP route to remove

• Local MP service availability

• Configuration correctness, such as CEX parameter matching with remove

4. Identify the most recent Connection Unavailable event in the event log for theconnection and use the Event's recovery steps to resolve the issue.


22102 - Connection DegradedAlarm Group:DIAM


5-86

Description:Connection is only available for routing messages with a priority greater than or equalto the connection's congestion level. This alarm is generated when:

• Connection congestion when the Peer CNDRA Tx sender buffer is at maximumcapacity

• The connection's administrative state is enabled and the connection is incongestion. Requests and Answers continue to be received and processed fromthe peer over the connection, and attempts to send Answers to the peer stilloccur. The alarm is raised when one of the following occurs:

– Connection's Operational Status transitions from available to degraded(connection has become congested or watchdog algorithm has failed)

– Connection's Operational Status transitions from unavailable to degraded(connection has successfully completed the capabilities exchange and isperforming connection proving)

• Connection egress message rate threshold has been crossed

• Diameter connection is in watchdog proving

• Diameter connection is in graceful disconnect

• Diameter peer signaled the remote is busy

• Diameter connection is in transport congestion

Severity:Major


HA Score:Normal


OID:eagleXgDiameterFsmOpStateDegraded

Cause:This alarm is raised when:

• Connection congestion when the Peer CNDRA Tx sender buffer is at maximumcapacity

• The connection's administrative state is enabled and the connection is incongestion. Requests and Answers will continue to be received and processedfrom the peer over the connection and attempts to send Answers to the peer willstill occur. The alarm is raised when one of the following occurs:

– Connection's Operational Status transitions from available to degraded(connection has become congested or watchdog algorithm has failed)


5-87

– Connection's Operational Status transitions from unavailable to degraded(connection has successfully completed the capabilities exchange and isperforming connection proving)

• Connection egress message rate threshold has been crossed

• Diameter connection is in watchdog proving

• Diameter connection is in graceful disconnect

• Diameter peer signaled that the remote is busy

• Diameter connection is in transport congestion


1. View the Connection Performance measurement report for the +/- 1 hourcongestion event.

2. Examine the Log file by using these commands:

• # date >> tcp_stat_<hostname>

• # cat /proc/net/tcp >> tcp_stat_<hostname>

• # sleep 1


• # sleep 1


• # sleep 1



3. Examine the output of the command, netstat -canp --tcp | grep<remote IP:Port for conn> for few minutes.

4. Examine the corresponding Rx buffer on the connection in question using thiscommand: netstat -canp --tcp | grep <remote IP:Port forconn>. The RxBuffer value is configured using ConnectionCfget.

5. Examine the overall network statistics for other issues using the command,netstat -i.

6. Examine the overall network delay using the command ping.

7. View the software release information.

1. Recovery:

1. View the Connection Performance measurement report for the +/- 1 hourcongestion event.

2. Examine the log file by using these commands:



• # sleep 1


• # sleep 1


5-88


• # sleep 1



3. Examine the output of the command netstat -canp --tcp | grep <remoteIP:Port for conn> for few minutes.

4. Examine the corresponding Rx buffer on the connection in question using thiscommand:netstat -canp --tcp | grep <remote IP:Port for conn>. TheRxBuffer value is configured using ConnectionCfget.

5. Examine the overall network statistics for other issues using the commandnetstat -i.

6. Examine the overall network delay using the command ping.

7. View the software release information.

8. Identify the most recent Connection Degraded event in the event log for theconnection and use the Event's recovery steps to resolve the issue.

9. Have the peer vendor examined their receive buffer usage during the event; if it is0, this means the received messages were processed quickly and messages werenot often stored in the receive buffer. In this case, Egress Transport Congestionwas due to the peer not processing the message quickly enough (verify byexamining the peer's receive buffer), or there is some delay introduced in thenetwork


22105 - Connection Transmit CongestionAlarm Group:DIAM

Description:Alarm is raised when the connection transmit buffer is congested; messages arediscarded until condition clears. This error indicates the socket write cannot completewithout blocking, which signals the socket buffer is currently full.

Severity:Major


HA Score:Normal


OID:eagleXgDiameterConnectionTxCongestionAlarmNotify


5-89

Cause:The socket write cannot complete without blocking, signaling that the socket buffer iscurrently full.

Diagnostic Information:N/A.

1. Recovery:

1. The peer is not able to process the volume of traffic being offered on theconnection. Reduce the traffic volume or increase the processing capacity on thepeer.


22106 - Ingress Message Discarded: DraWorker IngressMessageRate Control

Alarm Group:DIAM

Description:An ingress message is discarded due to connection (or DraWorker) ingress messagerate exceeding connection (or DraWorker) maximum ingress MPS.

Severity:Major

Instance:<MPHostName>

HA Score:Normal


OID:eagleXgDiameterIngressMessageDiscardedAlarmNotify

Cause:An ingress message is discarded or rejected in the following congestion scenarios:

• Connection maximum message rate exceeded.

• DraWorker maximum message rate exceeded.


1. From the event history, check the current message rate and the threshold rate forthe diameter connection/DAMP node.

2. Check the maximum reserved ingress MPS for the DAMP on the Active Overseerserver.

3. Ensure that the ingress MPS is less than the threshold for the diameterconnection/DAMP.

1. Recovery:


5-90

1. The ingress MPS on the DraWorker is exceeding the MP Maximum ingress MPS.Maybe one or more DraWorkers is unavailable and traffic has been distributed tothe remaining DraWorkers.

2. See if one or more peers are generating more traffic than is normally expected.

3. Make sure a sufficient number of DraWorkers is provisioned.


22200 - MP CPU CongestedAlarm Group:ExgStack

Description:DraWorker CPU utilization threshold has been exceeded. Potential causes are:

• One or more peers are generating more traffic than is normally expected

• Configuration requires more CPUs for message processing than is normallyexpected

• One or more peers are answering slowly, causing a backlog of pendingtransactions

• A DraWorker has failed, causing the redistribution of traffic to the remainingDraWorkers

Severity:Minor, Major, Critical, Warning

InstanceNA

HA Score:Normal


OID:eagleXgDiameterMpCpuCongestedNotify

Cause:Potential causes are:

• One or more peers are generating more traffic than is normally expected.

• Configuration requires more CPUs for message processing than is normallyexpected.


• A DraWorker has failed, causing the redistribution of traffic to the remainingDraWorkers.


5-91


1. Observe the ingress traffic rate of each MP.

a. The misconfiguration of server/client routing may result in too much trafficbeing distributed to the MP. Each MP in the server site should be receivingapproximately the same ingress transactions per second.

b. There may be an insufficient number of MPs configured to handle the networktraffic load. If all MPs are in congestion, then the traffic load to the server siteis exceeding its capacity.

2. Examine the alarm log.

3. Examine the DraWorker status.

1. Recovery:

1. If one or more MPs in a server site has failed, the traffic is distributed between theremaining MPs in the server site. Monitor the MP server status.

2. The mis-configuration of DIAMETER peers may result in too much traffic beingdistributed to the MP. Monitor the ingress traffic rate of each MP. Each MP in theserver site should be receiving approximately the same ingress transaction persecond.


4. The Diameter Process may be experiencing problems. Examine the alarm log.


22201 - MpRxAllRateAlarm Group:DIAM

Description:DraWorker ingress message rate threshold crossed.


Instance:MpRxAllRate, DIAM

HA Score:Normal


OID:eagleXgDiameterMpRxAllRateNotify

1. Recovery:



5-92




22202 - MpDiamMsgPoolCongestedAlarm Group:DIAM

Description:DraWorker Diameter message pool utilization threshold crossed.


Instance:MpDiamMsgPool, DIAM

HA Score:Normal


OID:eagleXgDiameterMpDiamMsgPoolCongestedNotify

1. Recovery:




4. A software defect may exist resulting in PDU buffers not being deallocated to thepool. This alarm should not normally occur when no other congestion alarms areasserted.


22203 - PTR Buffer Pool UtilizationAlarm Group:DIAM


5-93

Description:The MP's PTR buffer pool is approaching its maximum capacity. If this problempersists and the pool reaches 100% utilization all new ingress messages will bediscarded. This alarm should not normally occur when no other congestion alarms areasserted.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterPtrBufferPoolUtilNotify

1. Recovery:




4. A software defect may exist resulting in PTR buffers not being deallocated to thepool. This alarm should not normally occur when no other congestion alarms areasserted.


22204 - Request Message Queue UtilizationAlarm Group:DIAM

Description:The MP's Request Message Queue Utilization is approaching its maximum capacity.If this problem persists and the queue reaches 100% utilization all new ingressRequest messages will be discarded. This alarm should not normally occur when noother congestion alarms are asserted.


Instance:N/A


5-94

HA Score:Normal


OID:eagleXgDiameterRequestMessageQueueUtilNotify

1. Recovery:




4. If no additional congestion alarms are asserted, the Request Task may beexperiencing a problem preventing it from processing messages from its RequestMessage Queue.


22205 - Answer Message Queue UtilizationAlarm Group:DIAM

Description:The MP's Answer Message Queue Utilization is approaching its maximum capacity. Ifthis problem persists and the queue reaches 100% utilization all new ingress Answermessages will be discarded. This alarm should not normally occur when no othercongestion alarms are asserted.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterAnswerMessageQueueUtilNotify

1. Recovery:



5-95



4. If no additional congestion alarms are asserted, the Answer Task may beexperiencing a problem preventing it from processing messages from its AnswerMessage Queue.


22206 - Reroute Queue UtilizationAlarm Group:DIAM

Description:The MP's Reroute Queue is approaching its maximum capacity. If this problempersists and the queue reaches 100% utilization any transactions requiring reroutingwill be rejected. This alarm should not normally occur when no other congestionalarms are asserted.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterRerouteQueueUtilNotify

1. Recovery:

1. An excessive amount of Request message rerouting may have been triggered byeither connection failures or Answer time-outs.

2. If no additional congestion alarms are asserted, the Reroute Task may beexperiencing a problem preventing it from processing messages from its RerouteQueue.


22207 - DclTxTaskQueueCongestedAlarm Group:DIAM


5-96

Description:DCL egress task message queue utilization threshold crossed.



HA Score:Normal


OID:eagleXgDiameterDclTxTaskQueueCongested

1. Recovery:

1. The alarm will clear when the DCL egress task message queue utilization fallsbelow the clear threshold. The alarm may be caused by one or more peers beingrouted more traffic than is nominally expected.


22208 - DclTxConnQueueCongestedAlarm Group:DIAM

Description:DCL egress connection message queue utilization threshold crossed.


Instance:<ConnectionName>

HA Score:Normal


OID:eagleXgDiameterDclTxConnQueueCongested

1. Recovery:

1. The alarm will clear when the DCL egress connection message queue utilizationfalls below the clear threshold. The alarm may be caused by peers being routedmore traffic than nominally expected.

2. It is recommended to contact My Oracle Support for further assistance.


5-97

22209 - Message Copy DisabledAlarm Group:DIAM

Description:Diameter Message Copy is disabled.

Severity:Minor

Instance:N/A

HA Score:Normal


OID:eagleXgDiameterMessageCopyDisabledNotify

1. Recovery:




4. The Diameter Process may be experiencing problems.

5. If the problem persists, contact My Oracle Support.

22214 - Message Copy Queue UtilizationAlarm Group:DIAM

Description:The DraWorker's Message Copy queue utilization is approaching its maximumcapacity.


Instance:N/A


5-98

HA Score:Normal


OID:eagleXgDiameterMsgCopyQueueUtilNotify

1. Recovery:

1. Reduce traffic to the MP.

2. Verify that no network issues exist between the DraWorker and the intended DASpeer(s).

3. Verify that the intended DAS peer has sufficient capacity to process the traffic loadbeing routed to it.


22221 - Routing MPS RateAlarm Group:DIAM

Description:Message processing rate for this MP is approaching or exceeding its engineeredtraffic handling capacity. The routing mps rate (MPS/second) is approaching orexceeding its engineered traffic handling capacity for the MP.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterRoutingMpsRateNotify

1. Recovery:

1. If one or more MPs in a server site have failed, the traffic will be distributedamongst the remaining MPs in the server site.

2. The mis-configuration of Diameter peers may result in too much traffic beingdistributed to the MP.

Each MP in the server site should be receiving approximately the same ingresstransaction per second.

3. There may be an insufficient number of MPs configured to handle the networktraffic load.


5-99

If all MPs are in a congestion state then the ingress message rate to the MP isexceeding its capacity to process the messages.


22222 - Long Timeout PTR Buffer Pool UtilizationAlarm Group:DIAM

Description:The MP's Long Timeout PTR buffer pool is approaching its maximum capacity.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterLongTimeoutPtrBufferPoolUtilNotify

1. Recovery:

1. If one or more MPs in a server site have failed, the traffic will be distributedamongst the remaining MPs in the server site.

2. The misconfiguration of Pending Answer Timer assignment may result inexcessive traffic being assigned to the Long Timeout PTR buffer Pool.

3. The misconfiguration of Diameter peers may result in too much traffic beingdistributed to the MP. Each MP in the server site should be receivingapproximately the same ingress transaction per second


5. A software defect may exist resulting in Long Timeout PTR buffers not being de-allocated to the pool. This alarm should not normally occur when no othercongestion alarms are asserted. Examine the alarm log.


22223 - DraWorker Memory Utilization Threshold CrossedAlarm Group:DIAM

Description:DraWorker memory utilization threshold crossed.


5-100


Instance:System.RAM_UtilPct, Peer CNDRA

HA Score:Normal

Auto Clear Seconds:0 (zero, no auto clear)

OID:eagleXgDiameterMpMemCongestedNotify

Cause:Following are the potential causes:

• One or more peers are generating more traffic than expected.

• Configuration requires more Physical Memory for message processing thanexpected.


• A DraWorker failed, causing the redistribution of traffic to the remainingDraWorkers.

Diagnostic Information:To diagnose the cause:

1. Monitor the ingress traffic rate of each MP.

• The mis-configuration of server/client routing may result in too much trafficbeing distributed to the MP. Each MP in the server site should be receivingapproximately the same ingress transactions per second.

• There may be an insufficient number of MPs configured to handle the networktraffic load. If all MPs are in congestion, then the traffic load to the server siteis exceeding its capacity.

2. Examine the alarm log.

3. Examine the DraWorker status.

1. Recovery:

1. Analyze and correct routing so the traffic load is balanced between MPs.

2. If all MPs are approaching or exceeding their engineered traffic handling capacity,add more MPs to the system and configure connections and routes to distributetraffic to new DraWorkers.


22224 - Average Hold Time Limit ExceededAlarm Group:DIAM


5-101

Description:The average transaction hold time has exceeded its configured limits.This alarm is generated when KPI #10098 (TmAvgRspTime) exceeds Peer CNDRA-wide engineering attributes associated with average hold time, defined in theDraWorker profile assigned to the DraWorker server. KPI #10098 is defined as theaverage time (in milliseconds) from when the routing layer (DRL) receives a requestmessage from a downstream peer to the time that an answer response is sent to thatdownstream peer. The source measurement of KPI #10098 is theTmResponseTimeDownstreamMp (10093) measurement.This alarm indicates the average response time (TmAvgRspTime) for messagesforwarded by the Relay Agent is larger than what is defined for a deployment as perDraWorker profile assignment. One of these problems could exist:

• The IP network may be experiencing problems that are adding propagation delaysto the forwarded request message and the answer response.

– Verify the IP network connectivity exists between the MP server and theadjacent nodes.

– View the event history logs for additional events or alarms from this MPserver.

• One or more upstream nodes may be experiencing traffic overload.

• One or more MPs is experiencing traffic overload.

– View the KPI Routing Recv Msgs/Sec.

– View the CPU utilization of MPs.


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterAvgHoldTimeLimitExceededNotify

Cause:Alarm 22224 is generated when KPI #10098 (TmAvgRspTime) exceeds PeerCNDRA-wide engineering attributes associated with average hold time, defined in theDraWorker profile assigned to the DraWorker server. KPI #10098 is defined as theaverage time (in milliseconds) from when the routing layer (DRL) receives a requestmessage from a downstream peer to the time that an answer response is sent to thatdownstream peer. The source measurement of KPI #10098 is theTmResponseTimeDownstreamMp (10093) measurement.The alarm thresholds are configurable for:

• Average hold time minor alarm onset threshold

• Average hold time minor alarm abatement threshold

• Average hold time major alarm onset threshold


5-102

• Average hold time major alarm abatement threshold

• Average hold time critical alarm onset threshold

• Average hold time critical alarm abatement threshold

The severity of the alarm (Minor, Major, or Critical) is according to onset threshold/abatement threshold of each severity level. When the average hold time initiallyexceeds the average hold time for an alarm onset threshold, a minor, major, or criticalalarm is triggered. When the average hold time subsequently exceeds a higher onsetthreshold, or drops below an abatement threshold, but is still above the minor alarmabatement threshold, the alarm severity changes based on the highest onsetthreshold crossed by the current average hold time.

Diagnostic Information:If Alarm #22224 is raised, then it indicates the average response time(TmAvgRspTime) for messages forwarded by the Relay Agent is larger than thedefined for a deployment as per DraWorker profile assignment. One of the followingproblems could exist:


– Verify the IP network connectivity exists between the MP server and theadjacent nodes.

– View the event history logs for additional events or alarms from this MPserver.


• One or more upstream nodes may be experiencing traffic overload.

• One or more MPs is experiencing traffic overload.

– View the KPI Routing Recv Msgs/Sec.

– View the CPU utilization of MPs.

1. Recovery:

1. The average transaction hold time is exceeding its configured limits, resulting in anabnormally large number of outstanding transactions that may be leading toexcessive use of resources like memory.

• Reduce the average hold time by examining the configured Pending AnswerTimer values and reducing any values that are unnecessarily large or small.

• Identify the causes for the large average delay between the Peer CNDRAsending requests to the upstream peers and receiving answers for therequests.

• Confirm the peer node(s) or Peer CNDRA is in overload by viewing KPI/Measurements/CPU usage and take corrective action.

• Identify the main contributor to increased value of (T2-T1) such as a timedifference between the routing layer (DRL) receiving the request to the DRLsending the answer to downstream peer.

2. The alarm thresholds are configurable for:




5-103





The severity of the alarm (Minor, Major, or Critical) is according to the onsetthreshold/abatement threshold of each severity level. When the average hold timeinitially exceeds the average hold time for an alarm onset threshold, a minor,major, or critical alarm is triggered. When the average hold time subsequentlyexceeds a higher onset threshold, or drops below an abatement threshold, but isstill above the minor alarm abatement threshold, the alarm severity changes basedon the highest onset threshold crossed by the current average hold time.


22225 - Average Message Size Limit ExceededAlarm Group:DIAM

Description:The size of the average message processed by Peer CNDRA has exceeded itsconfigured limits.The alarm is generated when the measurement RxAvgMsgSize reaches the PeerCNDRA-wide engineering attributes, defined in the DaMpProfileParameterscorresponding to the MP profile being used. RxAvgMsgSize is defined as the size ofthe average message processed by Peer CNDRA.This alarm indicates Peer CNDRA has encountered a message it can accept forprocessing, but might not continue processing if the message size increases morethan the maximum supported message size. This increase can be due to standarddiameter processing (for example, Route Record additions to requests) or due tocustom processing (for example, Mediation modifying AVPs).


Instance:N/A

HA Score:Normal


OID:eagleXgDiameterAvgMsgSizeLimitExceededNotify

Cause:Alarm 22225 raises when the measurement RxAvgMsgSize reaches the PeerCNDRA-wide engineering attributes, defined in the DaMpProfileParameterscorresponding to the MP profile being used.RxAvgMsgSize is defined as the size of the average message processed by PeerCNDRA.


5-104

• Average message size minor alarm onset threshold

• Average message size minor alarm abatement threshold

• Average message size major alarm onset threshold

• Average message size major alarm abatement threshold

• Average message size critical alarm onset threshold

• Average message size critical alarm abatement threshold

The severity of alarm (Minor, Major, or Critical) is according to onset/abatementthreshold of each severity level. When the average message size reaches the valueof the respective alarm onset/abatement threshold, within 3 seconds the alarm israised with severity Minor, Major, or Critical, based on the value reached by theaverage message size.

Diagnostic Information:This event indicates that Peer CNDRA has encountered a message that it can acceptfor processing, but might not continue processing if the message size increases morethan the maximum supported message size. This increase can be due to standarddiameter processing (for example, RouteRecord additions to requests) or due tocustom processing (for example, Mediation modifying AVPs).

1. Recovery:

1. Examine the traffic coming from connected peers to see if any of them are sendingabnormally large messages, and look for any special processing rules beingapplied by Peer CNDRA to that message.

2. The alarm thresholds are configurable for:







The severity of the alarm (Minor, Major, or Critical) is according to the onsetthreshold/abatement threshold of each severity level. When the average hold timeinitially exceeds the average hold time for an alarm onset threshold, a minor,major, or critical alarm is triggered. When the average hold time subsequentlyexceeds a higher onset threshold, or drops below an abatement threshold, but isstill above the minor alarm abatement threshold, the alarm severity changes basedon the highest onset threshold crossed by the current average hold time.


22328 - Connection is processing a higher than normal ingressmessaging rate

Alarm Group:DIAM


5-105

Description:The diameter connection specified in the alarm instance is processing a higher thannormal ingress messaging rate.

Severity:

• Minor (if all of the following are true):

– The average ingress MPS rate the connection is processing has reached thepercentage of the connection's maximum ingress MPS rate configured for theconnection minor alarm threshold.

– The average ingress MPS rate the connection is processing has not yetreached the percentage of the connection's maximum ingress MPS rateconfigured for the connection major alarm threshold.

• Major (if the following are true):

– The average ingress MPS rate the connection is processing has reached thepercentage of the connection's maximum ingress MPS rate configured for theconnection major alarm threshold.

Instance:The name of the diameter connection as defined by the TransportConnection table

HA Score:Normal


OID:eagleXgDiameterIngressMpsRateNotify

Cause:Alarm # 22328 raises the severity,

Minor (if all of the following are true):

• The average ingress MPS rate that the connection is processing has reached thepercentage of the connection's maximum ingress MPS rate configured for theconnection minor alarm threshold.

• The average ingress MPS rate that the connection is processing has not yetreached the percentage of the connection's maximum ingress MPS rateconfigured for the connection major alarm threshold.

Major (if all of the following are true):

• The average ingress MPS rate that the connection is processing has reached thepercentage of the connection's maximum ingress MPS rate configured for theconnection major alarm threshold.

Diagnostic Information:To get further information regarding this issue:



5-106

2. Get the Connection ID IcRate[Connection_Id] from Alarm Details and thecorresponding Connection Name from TransportConnectionTable on activeOverseer server.

3. Investigate the connection's remote Diameter peer (the source of the ingressmessaging) to determine why they are sending the abnormally high traffic rate.

1. Recovery:

1. The Diameter connection specified in the Alarm Instance field is processing ahigher than expected average ingress Diameter message rate. The alarmthresholds for minor and major alarms are configured in the CapacityConfiguration Set used by the Diameter connection.

2. The message rate used for this alarm is an exponentially smoothed 30 secondaverage. This smoothing limits false alarms due to short duration spikes in theingress message rate.

3. If the alarm severity is minor, the alarm means the average ingress message ratehas exceeded the minor alarm threshold percentage of the maximum ingress MPSconfigured for the connection.

4. If the alarm severity is major, the alarm means the average ingress message ratehas exceeded the major alarm threshold percentage of the maximum ingress MPSconfigured for the connection.

5. This alarm is cleared when the average ingress message rate falls 5% below theminor alarm threshold, or the connection becomes disabled or disconnected. Thisalarm is downgraded from major to minor if the average ingress message rate falls5% below the major alarm threshold.

6. If the average ingress message rate is determined to be unusually high,investigate the connection's remote Diameter peer (the source of the ingressmessaging) to determine why they are sending the abnormally high traffic rate;otherwise, consider increasing either the connection's maximum ingress MPS rateor the connection's alarm thresholds.


22350 - Fixed Connection Alarm Aggregation ThresholdAlarm Group:DIAM

Description:This alarm occurs when there are a critical number of fixed connection alarms for theDraWorker.

Severity:Major, Critical

Note:

The Critical threshold may be disabled by setting the Critical Threshold tozero.


5-107

Instance:<DraWorker-Hostname>

HA Score:Normal


OID:eagleXgDiameterConnUnavailableThresholdReachedNotify

Cause:The alarm #22350 raises when there are a critical number of fixed connection alarmsfor the DraWorker.


1. Find all the connections with a problem for the specific MP.

2. For each connection with a problem, verify:

a. The remote host is reachable from the local MP by using ssh to the MP andpinging the remote server IP (if using IP address) or server FQDN (if usingFQDN)

b. DNS availability should be tested by pinging the DNS server IP

c. FQDN resolving should be tested by using nslookup to check the FQDNresolving on the MP

3. If the above tests reveal the remote host is not reachable, then verify that there isno network problem on the remote server.

4. If the remote server is reachable, then verify the processes are running correctly.

a. Verify the local Peer CNDRA process is running by checking the ps -efoutput

b. Verify the local node is listening on the correct port by using netstat -naand checking the correct transport type, tcp/sctp port is listening

c. Use wireshark or tcpdump to capture traffic messages, and verify theconnection is established (confirm the handshake process is occurring forSCTP or TCP)

5. If the port is not listening, or the handshake procedure is not occurring, then theprocess or server may be in trouble.

6. If the connection/association is established, then ensure that the Diameterhandshake is happening and correct, by checking the Diameter CEX messageexchange, for information like server FQDN, IP address, or applicationssupported; mismatching information causes the connection to abort.

7. If Diameter handshake is good, then observe the health of the Diameterconnection by verifying the DWR messages are answered correctly.

1. Recovery:

1. Check Fixed Connection status.


5-108

2. Confirm the peer connection configuration (protocol, remote/local IP address,remote/local port) matches the local connection configuration.

3. Confirm the connection’s transport protocol and/or port are not being blocked by anetwork firewall or other ACL in the network path.

4. Verify the peers in the Route List are not under maintenance.

5. Modify the value of Alarm Threshold Options if it is set too low.


22900 - DPI DB Table Monitoring OverrunEvent Type:DIAM

Description:The COMCOL update sync log used by DB Table monitoring to synchronize DiameterConnection Status among all DraWorker RT-DBs has overrun. The DraWorker'sDiameter Connection Status sharing table is automatically audited and re-synced tocorrect any inconsistencies.

Severity:Info

Instance:<DbTblName>

Note:

<DbTblName> refers to the name of the Diameter Connection StatusSharing Table the Diameter Connection status inconsistency that wasdetected.

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterDpiTblMonCbOnLogOverrunNotify

1. Recovery:

• It is recommended to contact My Oracle Support if this alarm is constantly beingasserted and cleared.

22901 - DPI DB Table Monitoring ErrorEvent Type:DIAM


5-109

Description:An unexpected error occurred during DB Table Monitoring.

Severity:Info

Instance:DpiTblMonThreadName

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterDpiSldbMonAbnormalErrorNotify

1. Recovery:

• It is recommended to contact My Oracle Support.

22950 - Connection Status Inconsistency ExistsAlarm Group:DIAM

Description:Diameter Connection status inconsistencies exist among the DraWorkers in the PeerCNDRA signaling NE.

Severity:Critical

Instance:<DbTblName> Name of the Diameter Connection Status Sharing Table where theDiameter Connection status inconsistency was detected.

HA Score:Normal


OID:eagleXgDiameterConnStatusInconsistencyExistsNotify

Cause:The data inconsistency might have caused due to the following reasons:

• Network issue, the change log is not distributed to the destination MP.

• Process error (update is disturbed) in executing change on the destination MP.

Diagnostic Information:No specific diagnostic information is required if alarm clears in the next audit/sync.Analyze the error log if the problem persists.


5-110

1. Recovery:


Note:

DraWorker's SLDB tables are automatically audited and re-synchronizedto correct inconsistencies after a log overrun has occurred. TheAutomatic Data Integrity Check, which was introduced in cm6.2,periodically scans almost the entire local IDB for integrity. The initialdefault period is 30 minutes.

22961 - Insufficient Memory for Feature SetAlarm Group:DIAM

Description:The available memory (in kilobytes) for feature set is less than the required memory(in kilobytes). This alarm is raised when a DA-MP is brought into service and a DA-MP configured DiamaterMaxMessageSize in DpiOption table value is greater than16KB, but the available memory on DA-MP is less than 48GB.

Severity:Critical

Instance:N/A

HA Score:Normal


OID:eagleXgDiameterInsufficientAvailMemNotify

Cause:Alarm #22961 raises when a DA-MP is brought into service and a DA-MP configuredDiamaterMaxMessageSize in DpiOption table value is greater than 16KB but theavailable memory on DA-MP is less than 48GB.

Diagnostic Information:N/A.

1. Recovery:

1. Make additional memory available on the DA-MP for the configuredDiameterMaxMessageSize.



5-111

25612 - Peer CNDRA ping failedAlarm GroupDIAM

DescriptionConnection was rejected due to the DraWorker exceeding its connection or ingressMPS capacity

SeverityMajor

InstancepingAllLivePeers

HA ScoreNormal

Auto Clear SecondsN/A

OIDeagleXgDiameterPingAllLivePeerErrorNotify

1. Recovery

1. Check /var/log/messages and /var/log/cron for more information.

2. Run pingAllLivePeers -v and pingAllLivePeers -h as root on the commandline.

3. If the problem persists, it is recommended to contact My Oracle Support forassistance.

25613 – Peer Node Alarm Group ThresholdEvent Type:DIAM

Description:Peer Node Alarm Group Threshold Reached. This alarm occurs when there are anumber of minor, major, or critical Peer Node alarms for a single Peer Node AlarmGroup.

Severity:Minor, Major, and Critical

Instance:<PeerNodeAlarmGroupName>

HA Score:Normal



5-112

OID:eagleXgDiameterPeerNodeAlarmGroupThresholdReachedNotify

1. Check status of Peer nodes.



4. It is recommended to contact My Oracle Support if further assistance is needed.

25614 - Connection Alarm Group ThresholdEvent Type:DIAM

Description:Connection Alarm Group Threshold Reached. This alarm occurs when there are anumber of minor, major, or critical Connection alarms for a single Connection AlarmGroup.

Severity:Minor, Major, and Critical

Instance:<ConnectionAlarmGroupName>

HA Score:Normal


OID:eagleXgDiameterConnectionAlarmGroupThresholdReachedNotify

1. Check Connections status.


3. Verify the connection is not under maintenance.

4. It is recommended to contact My Oracle Support if further assistance is needed.

25806 - Invalid Internal Overseer Server Group DesignationAlarm GroupDIAM

DescriptionInvalid Internal Overseer Server Group Designation

SeverityMinor


5-113

Instance<Route List Name>&<Route Group Name>&<TTG SG Name>&<TTG Name>

HA ScoreNormal

Auto Clear SecondsN/A

OIDeagleXgDiameterDoicInvalidInternalSoamSgDesignationNotify

1. Recovery

• For the Route List named in the alarm instance, edit its configuration and deletethe association to the Shared TTG. This will clear the alarm. The association cansimply be re-added to restore integrity to the configuration.

Range Based Address Resolution (RBAR) Alarms andEvents (22400-22424)

22400 - Message Decoding FailureEvent Type:RBAR

Description:A message received was rejected because of a decoding failure.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarMsgRejectedDecodingFailureNotify

1. Recovery:

• While parsing the message, the message content was inconsistent with theMessage Length in the message header. These protocol violations can be causedby the originator of the message (identified by the Origin-Host AVP in themessage) or the peer who forwarded the message to this node.

Chapter 5Range Based Address Resolution (RBAR) Alarms and Events (22400-22424)

5-114

22401 - Unknown Application IDEvent Type:RBAR

Description:A message could not be routed because the Diameter Application ID is not supported.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarUnknownApplIdNotify

1. Recovery:

1. The Peer CNDRA Relay Agent forwarded a Request message to the addressresolution application which contained an unrecognized Diameter Application ID inthe header. Either a Peer CNDRA Relay Agent application routing rule is mis-provisioned or the Application ID is not provisioned in the RBAR routingconfiguration.

2. Check the currently provisioned Diameter Application IDs.

3. Check the currently provisioned Application Routing Rules.

22402 - Unknown Command CodeEvent Type:RBAR

Description:A message could not be routed because the Diameter Command Code in the ingressRequest message is not supported and the Routing Exception was configured to sendan Answer response.

Severity:Info

Instance:<MPName>

HA Score:Normal


5-115

Throttle Seconds:10

OID:eagleXgDiameterRbarUnknownCmdCodeNotify

1. Recovery:

1. The order pair (Application ID, Command Code) is not provisioned in the AddressResolutions routing configuration.

2. Check the currently provisioned Application IDs and Command Codes.

22403 - No Routing Entity Address AVPsEvent Type:RBAR

Description:A message could not be routed because no address AVPs were found in themessage and the Routing Exception was configured to send an Answer response.

Severity:Info

Instance:<AddressResolution>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarNoRoutingEntityAddrAvpNotify

1. Recovery:

1. This may be a normal event or an event associated with misprovisioned addressresolution configuration. If this event is considered abnormal, validate which AVPsare configured for routing with the Application ID and Command Code.


22404 - No valid Routing Entity Addresses foundEvent Type:RBAR

Description:A message could not be routed because none of the address AVPs contained a validaddress and the Routing Exception was configured to send an Answer response.

Severity:Info


5-116


HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarNoValidRoutingEntityAddrFoundNotify

1. Recovery:

1. This may be a normal event or an event associated with misprovisioned addressresolution configuration. If this event is considered abnormal, validate which AVPsare configured for routing with the Application ID and Command Code.


22405 - Valid address received didn’t match a provisioned address oraddress range

Event Type:RBAR

Description:A message could not be routed because a valid address was found that did not matchan individual address or address range associated with the Application ID, CommandCode, and Routing Entity Type, and the Routing Exception was configured to send anAnswer response.

Severity:Info


HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarAddrMismatchWithProvisionedAddressNotify

1. Recovery:

1. An individual address or address range associated with the Application ID,Command Code and Routing Entity Type may be missing from the RBARconfiguration. Validate which address and address range tables are associatedwith the Application ID, Command Code and Routing Entity Type.

2. View the currently provisioned Application IDs, Command Codes, and RoutingEntity Types by selecting RBAR, and then Configuration, and then AddressResolutions.


5-117

22406 - Routing attempt failed due to internal resource exhaustionEvent Type:RBAR

Description:A message could not be routed because the internal "Request Message Queue" tothe Peer CNDRA Relay Agent was full. This should not occur unless the MP isexperiencing local congestion as indicated by Alarm-ID 22200 - MP CPU Congested.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarRoutingAttemptFailureInternalResExhNotify

1. Recovery:

• If this problem occurs, it is recommended to contact My Oracle Support.

22407 - Routing attempt failed due to internal database inconsistencyfailure

Event Type:RBAR

Description:A message could not be routed because an internal address resolution run-timedatabase inconsistency was encountered.

Severity:Info

Instance:<MPName>

HA Score:Normal

Throttle Seconds:10

OID:eagleXgDiameterRbarRoutingFailureInternalDbInconsistencyNotify


5-118

1. Recovery:

• If this problem occurs, it is recommended to contact My Oracle Support.

22411 - Address Range Lookup for Local Identifier skippedAlarm Group:RBAR

Description:Address Range Lookup could not be performed for the Local Identifier component ofthe Routing Entity Type External Identifier. Address Resolution used the Destinationfound using Domain Identifier.

Severity:Info

Instance:xxx

HA Score:Normal


OID:xxx

1. Recovery:

• It is recommended to contact My Oracle Support for assistance if needed.

Generic Application Alarms and Events (22500-22599)

Note:

These alarms are generic across the various Peer CNDRA applications withsome details varying depending on the application generating the alarm.

22500 - Peer CNDRA Application UnavailableAlarm Group:APPL

Description:Peer CNDRA application is unable to process any messages because it isunavailable.

Severity:Critical

Chapter 5Generic Application Alarms and Events (22500-22599)

5-119


Note:

The value for Peer CNDRA Application Name varies depending on the PeerCNDRA application generating the alarm such as RBAR. Use the name thatcorresponds to the specific Peer CNDRA application in use.

HA Score:Normal


OID:eagleXgDiameterCndraApplicationUnavailableNotify

Cause:The alarm #22500 is raises:

• When the Peer CNDRA application completes initialization and determines itsoperational status is unavailable after changing its admin state from disabled toenabled.

• When the Peer CNDRA application is in enabled state and the following PeerCNDRA application operational status changes occur:

– Available → Unavailable

– Degraded → Unavailable

This alarm is clears:

• When Peer CNDRA application is in enabled state and the following Peer CNDRAapplication operational status changes occur:

– Unavailable → Available

– Unavailable → Degraded

• If the Diameter process is stopped.

• If the Peer CNDRA application admin state change from Enabled > Disabled.


• A Peer CNDRA application operation status becomes unavailable when either theAdmin State is set to Disable with the Forced Shutdown option, or the AdminState is set to Disable with the Graceful Shutdown option and the GracefulShutdown timer expires.

• A Peer CNDRA application can also become unavailable when it reachesCongestion Level 3 if enabled.


5-120

Note:

This alarm is NOT raised when the Peer CNDRA application is shuttingdown gracefully or application is in Disabled state. Only the Peer CNDRAApplication operational status is changed to unavailable.

1. Recovery:

1. Display and monitor the Peer CNDRA application status. Verify the Admin State isset as expected.

2. A Peer CNDRA application operation status becomes unavailable when either theAdmin State is set to disable with the Forced Shutdown option, or the Admin Stateis set to disable with the Graceful Shutdown option and the Graceful Shutdowntimer expires.


22501 - Peer CNDRA Application DegradedAlarm Group:APPL

Description:Unable to forward requests to the Peer CNDRA application because it is degraded.

Severity:Major


Note:

The value for Peer CNDRA Application Name varies depending on the PeerCNDRA application generating the alarm such as RBAR. Use the name thatcorresponds to the specific Peer CNDRA application in use.

HA Score:Normal


OID:eagleXgDiameterCndraApplicationDegradedNotify

Cause:The alarm #22501 raises when the Peer CNDRA application is in enabled state andthe following Peer CNDRA Application Operational Status changes occur:

• Available → Degraded


5-121

• Unavailable → Degraded

This alarm is cleared when the Peer CNDRA application is in enabled state andfollowing Peer CNDRA Application Operational Status changes occur:

• Degraded → Available

• Degraded → Unavailable


• A Peer CNDRA application becomes degraded when the Peer CNDRAapplication becomes congested if enabled. This alarm is NOT raised when thePeer CNDRA application is shutting down gracefully or application is in thedisabled state.

• Verify the admin state is set as expected. Check the Event History logs foradditional DIAM events or alarms from this MP server.

1. Recovery:

1. Check the Peer CNDRA application status. Verify the Admin State is set asexpected.

2. A Peer CNDRA application becomes degraded when the Peer CNDRA applicationbecomes congested, if enabled.

Note:

This alarm is NOT raised when the Peer CNDRA application is shuttingdown gracefully or application is in the disabled state. Only the PeerCNDRA application operational status is changed to unavailable.

3. Check the Event History logs for additional DIAM events or alarms for this MPserver.


22502 - Peer CNDRA Application Request Message Queue UtilizationAlarm Group:APPL

Description:The Peer CNDRA Application Request Message Queue Utilization is approaching itsmaximum capacity.


Instance:<Metric ID>, <Peer CNDRA Application Name>


5-122

Note:

The value for Metric ID for this alarm varies (such asRxRbarRequestMsgQueue) depending on which Peer CNDRA applicationgenerates the alarm (such as RBAR). Use the ID that corresponds to thespecific Peer CNDRA application in use.

Note:

The value for Peer CNDRA Application Name will vary depending on thePeer CNDRA application generating the alarm (such as RBAR). Use thename that corresponds to the specific Peer CNDRA application in use.

HA Score:Normal


OID:eagleXgDiameterCndraApplicationRequestQueueUtilNotify

Cause:Alarm #22502 is raises:

• When Peer CNDRA Application Request Message Queue Utilization isapproaching its maximum capacity.

• If this problem persists and the queue reaches 100% utilization all new ingressRequest messages will be discarded.


1. Examine the alarm log on the active Overseer server.

2. This alarm should not normally occur when no other congestion alarms areasserted.

1. Recovery:

1. Display and monitor the Peer CNDRA application status. Verify the Admin State isset as expected.

The Peer CNDRA application's Request Message Queue Utilization isapproaching its maximum capacity. This alarm should not normally occur when noother congestion alarms are asserted.

2. Application Routing might be mis-configured and is sending too much traffic to thePeer CNDRA Application. Verify the configuration.

3. If no additional congestion alarms are asserted, the Peer CNDRA application taskmight be experiencing a problem that is preventing it from processing messagesfrom its Request Message Queue. Examine the Alarm log on the active Overseerserver.



5-123

22503 - Peer CNDRA Application Answer Message Queue UtilizationAlarm Group:APPL

Description:The Peer CNDRA Application Answer Message Queue Utilization is approaching itsmaximum capacity.



Note:

The value for Metric ID for this alarm varies (such asRxRbarAnswerMsgQueue) depending on which Peer CNDRA applicationgenerates the alarm (such as RBAR). Use the ID that corresponds to thespecific Peer CNDRA application in use.

Note:

The value for the Peer CNDRA Application Name varies depending on thePeer CNDRA application generating the alarm (such as RBAR). Use thename that corresponds to the specific Peer CNDRA application in use.

HA Score:Normal


OID:eagleXgDiameterCndraApplicationAnswerQueueUtilNotify

Cause:Alarm #22503 raises:

• When Peer CNDRA Application AnswerMessage Queue Utilization isapproaching its maximum capacity.

• If this problem persists and the queue reaches 100% utilization, all new ingressAnswer messages will be discarded.


1. Examine the alarm log on the active Overseer server.

2. This alarm should not occur when no other congestion alarms are asserted.


5-124

1. Recovery:

1. Application Routing might be mis-configured and is sending too much traffic to thePeer CNDRA application. Verify the configuration.

2. If no additional congestion alarms are asserted, the Peer CNDRA application taskmight be experiencing a problem that is preventing it from processing messagefrom its Answer Message Queue. Examine the Alarm log on the active Overseerserver.


22504 - Peer CNDRA Application Ingress Message RateAlarm Group:APPL

Description:The ingress message rate for the Peer CNDRA application is exceeding itsengineered traffic handling capacity.



Note:

The value for metric ID for this alarm varies (such as RxRbarMsgRate)depending on which Peer CNDRA application generates the alarm (such asRBAR). Use the ID that corresponds to the specific Peer CNDRA applicationin use.

Note:

The value for Peer CNDRA Application Name varies depending on the PeerCNDRA application generating the alarm (such as RBAR, etc.). Use thename that corresponds to the specific Peer CNDRA application in use.

HA Score:Normal


OID:eagleXgDiameterCndraApplicationIngressMsgRateNotify

Cause:The alarm #22504 raises when the ingress message rate for the Peer CNDRAApplication is approaching or exceeding its engineered traffic handling capacity.


5-125

This alarm get cleared when the diameter process stops.

Diagnostic Information:For further information regarding this alarm:


2. Average Ingress Message rate utilization on a MP Server of the Peer CNDRAApplication is exceeding or approaching engineering traffic handling capacity.

1. Recovery:

1. Application routing may be mis-configured and is sending too much traffic to thePeer CNDRA application. Verify the configuration.

2. There may be an insufficient number of MPs configured to handle the networkload. Monitor the ingress traffic rate of each MP.

3. If MPs are in a congestion state, then the offered load to the server site isexceeding its capacity.


22520 - Peer CNDRA Application EnabledEvent Type:APPL

Description:Peer CNDRA Application Admin state was changed to ‘enabled’.

Severity:Info


HA Score:Normal


OID:eagleXgDiameterCndraApplicationEnabledNotify

1. Recovery:


22521 - Peer CNDRA Application DisabledEvent Type:APPL

Description:Peer CNDRA Application Admin state was changed to ‘disabled’.


5-126

Severity:Info


HA Score:Normal


OID:eagleXgDiameterCndrapplicationDisabledNotify

1. Recovery:



5-127

ACNE modification for CNDRA Alerting andSNMP Integration

Modification in CNE elements for CNDRA Alerting/SNMP Integration

The CNDRA Alerting requires modification to the common CNE elements such asAlertManager and SNMPNotifier.

The required changes to CNE elements are described below in details:

Alert Manager

• The group_by setting in the alert manager config map.

– The following Alert labels are used in group_by option, for alerts groupingtowards snmp_notifier:

* namespace

* podname

* severity

* instancename

* alertname

– The group_by option change in Prometheus-alertmanager configmap asshown in the following example:kubectl edit configmap occne-prometheus-alertmanager -n occne-infra

apiVersion: v1data: alertmanager.yml: | global: { resolve_timeout: 9y } receivers: - name: default-receiver webhook_configs: - url: http://occne-snmp-notifier:9464/alerts route: group_interval: 5m group_wait: 10s receiver: default-receiver repeat_interval: 3h routes: - receiver: default-receiver group_interval: 1m group_wait: 10s repeat_interval: 9y group_by: [namespace, podname, severity, instancename, alertname]

A-1

match_re: oid: ^1.3.6.1.4.1.323.5.3.48.(.*)

Note:

The above OID is used for pattern match groups alerts for all tekelecproducts, such as for CNDRA ^1.3.6.1.4.1.323.5.3.48.(.*), the48 represents CNDRA product identifier as defined in thetklc_toplevel.mib.

Following are the new MOs based Alerts, introduced in CNDRA Alert Rules file. TheMOs information is populated under Label instancename in the Alerts.

• ComAgentQueueUtil

• ComAgentAbnormTransEndRate

• SmsQueueUtil

• BDFQueueUtil

Note:

The MO based alerts specify the MO instance for which the alert has beenraised.

SNMP Notifier

• Default severity provided by SNMPNotifier in the CNE installation is critical/waring/info

• CNDRA require support for additional severities which are as follows:

– major

– minor

– clear

• To add these additional severities, modify the snmp_notifier deployment byexecuting:kubectl edit deploy <snmp_notifier_deploy_name> -n occne-infra

For example: kubectl edit deploy occne-snmp-notifier -n occne-infra

spec: containers: - args: - --alert.default-severity=critical - --alert.severities=critical,warning,info,major,minor,clear - --alert.severity-label=severity - --log.format=logger:stderr - --log.level=error - --snmp.destination=10.75.203.120:165 - --snmp.retries=1

Appendix A

A-2

- --snmp.trap-default-oid=1.3.6.1.4.1.1664.1 - --snmp.trap-description-template=/etc/snmp_notifier/description-template.tpl - --snmp.trap-oid-label=oid - --web.listen-address=:9464 env:

Appendix A

A-3

Date post:	01-Mar-2021
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Cloud Native Diameter Routing Agent (CNDRA) User Guide22222 - Long Timeout PTR Buffer Pool...

Documents