+ All Categories
Home > Documents > Hitachi IT Operations Analyzer -...

Hitachi IT Operations Analyzer -...

Date post: 26-Apr-2018
Category:
Upload: dinhbao
View: 234 times
Download: 3 times
Share this document with a friend
18
Hitachi IT Operations Analyzer Best Practices for Analyzing Bottlenecks Application Brief
Transcript

Hitachi IT Operations Analyzer Best Practices for Analyzing Bottlenecks Application Brief

2

Executive Summary IT Operations Analyzer is designed to support the needs of datacenter administrators who have experienced challenges with managing complex IT systems with few support personnel. When performance problems occur, downtime can be particularly costly when administrators are faced with researching problem causes and their solutions, especially without the use of accurate software tools to help with that process. This document introduces the Bottleneck Analysis (BNA) function of Hitachi IT Operations Analyzer, which identifies a performance issue and its potential causes. The BNA function covers a wide range of performance issues in diverse environments, such as IP-SAN/FC-SAN storage environments and virtual server environments (ESX). Because it uses built-in expert algorithms, there is no need for administrators to have special knowledge or training in order to identify and troubleshoot performance problems. In order to realize the full potential of BNA when monitoring performance, it is important to understand and follow some basic guidelines for applying performance thresholds. This document describes the types of performance thresholds that can be applied using templates, and it provides use cases for how BNA is determined using different threshold options.

3

Contents

Background .................................................................................................................................... 4

Overview of Bottleneck Analysis ................................................................................................. 5

Function of Bottleneck Analysis .................................................................................................. 5

How performance metrics are correlated .................................................................................... 5

Role of performance thresholds .................................................................................................. 6

Threshold setting factors ............................................................................................................. 7

Performance metrics in Analyzer ................................................................................................. 8

Best practices for Bottleneck Analysis (BNA) and Resolution .............................................. 10

Reviewing BNA results within IT Operations Analyzer ............................................................. 10

Case1: Performance problem at the FC switch port ................................................................. 14

Case2: Performance problem at the storage controller ............................................................ 16

Case3: Performance problem on an ESX server's CPU ........................................................... 17

Summary ...................................................................................................................................... 18

4

Hitachi IT Operations Analyzer Best Practices for Analyzing Bottlenecks Application Brief

Background Today, the IT system of most organizations has become larger and more complex than in previous years. An IT system is no longer just an infrastructure of the business, it is a critical function of the business's value and competitiveness in the marketplace—a key business partner. And, as with any business partnership, downtime has a significant impact to productivity and profit. The guardian of the IT system—the person who monitors its status and prevents and addresses faults—is the system administrator. Without the right tools, it can be challenging for the administrator to identify core faults. The longer it takes to troubleshoot at-fault components, the greater the downtime. Many IT systems consist of multiple storage, servers, and switches. These components are interrelated: When a fault occurs at one component, many other nodes are affected. Typically, the related components raise alerts that are reported to the administrator. In a complex IT environment, it takes time for the administrator to identify the source of the fault: The root cause. There can be any number of reasons for the root cause, such a performance problem. To resolve performance problems, the administrator must know about the performance metrics that are used by the server, storage, and network switch; the relationships among the failure nodes, and be very familiar with the network topology. Hitachi IT Operations Analyzer is a software solution that helps administrators to identify the root cause of the performance problem (the “performance bottleneck”). Because it monitors all servers, storage, and switches in the IT environment, it can analyze performance problems in real time and accurately identify the candidates of a performance bottleneck: The potential root causes. IT Operations Analyzer is the solution to identify and report performance problems in the IT system, eliminating the manual, time-intensive work by the administrator to troubleshoot a problem. With IT Operations Analyzer, performance bottlenecks are identified for the administrator, and can be fixed quickly and efficiently.

5

Overview of Bottleneck Analysis

Function of Bottleneck Analysis IT Operations Analyzer monitors the performance of IT resources, such as computers, network devices (IP switches, FC switches), and SAN storage subsystems, and when a specified performance value exceeds a designated threshold, IT Operations Analyzer reports a performance event. Because it is difficult for an administrator to identify the performance bottleneck based on the complexity of the monitored environment, IT Operations Analyzer helps the administrator by identifying the root causes of the problem. The Bottleneck Analysis ("BNA") function of IT Operations Analyzer has various event correlation rules that work to identify a bottleneck. Because these rules are predefined and embedded in the software, no separate configuration or expert knowledge by the user is needed. These rules automatically adopt to the target IT system configuration, and expand whenever the system configuration changes. So, when multiple performance problems occur simultaneously on two or more IT resources that are affected by a performance bottleneck component, IT Operations Analyzer can pinpoint the causes.

How performance metrics are correlated IT Operations Analyzer covers performance problems relating to the slowdown of I/O access to IP-SAN/FC-SAN storage, and the slowdown of both virtual and physical servers. The root cause of the slowdown of the I/O of storage devices is attributed to an overflow of the network bandwidth, or an overload at the client-side machine's CPU.

Figure 1 and 2 illustrate the typical performance problems that are supported by IT Operations Analyzer's BNA, such as performance problems with an FC-SAN topology, and with a virtual machine environment. Figure 1 Disk performance error caused by an overflow of the network bandwidth on the

FC switch port

FC - SW Port

Pfm ERR

HDD Pfm ERR

FC - SAN

storage subsystem

storage subsystem

HDD Pfm ERR

HDD Pfm ERR

× ×

× ×

× ×

6

Figure 2 Virtual server performance error caused by CPU overload at the host server

Role of performance thresholds By using Threshold Templates, you can specify warning and error thresholds for your monitored nodes. You can create templates for: Windows: Specify the warning and error thresholds for Windows server components, such

as CPU and memory utilization, and disk space. Linux: Specify the warning and error thresholds for Linux server components, such as CPU

and memory utilization, and disk space. Solaris: Specify the warning and error thresholds for Solaris server components, such as

CPU and memory utilization, and disk space. ESX: Specify the warning and error thresholds for ESX server components, such as CPU

and memory utilization. Hitachi Storage: Specify the warning and error thresholds for Hitachi storage components

(AMS, SMS, and so on), such as the minimum and maximum I/O utilization and response times.

Storage (SMI-S WBEM): For specific non-Hitachi storage, such as a EMC storage, specify the warning and error thresholds for components, such as the Write Hit Cache Ratio and StoragePool Free Space.

FC Switches: For specific FC switches, such as a Brocade switch, specify warning and error thresholds for components, such as the minimum and maximum speed at which data is received.

IP Switches: For specific IP switches, such as a Cisco switch, specify warning and error thresholds for components, such as packet transmission times.

After creating templates, you can apply them to monitored devices, servers, storages or switches, in IT Operations Analyzer's Monitoring module.

VMWare ESX Server (R)VMWare ESX Server (R)

virtual

server

virtual

server

virtual switch

virtual

server

CPU Pfm ERR

CPU Pfm ERR CPU Pfm ERR CPU Pfm ERR

×× ×× ××

7

Threshold setting factors When templates are applied to each monitored device, and the specified performance value exceeds the threshold that is defined in the template, a performance event is generated. However, if the threshold that is specified for a performance metric is inappropriate, then as a consequence, BNA may not work correctly. When an appropriate threshold is applied to a customer's IT resources, the BNA result is more accurate. An “appropriate threshold” means that the thresholds that are applied to the metrics in the template are applicable to the node to which the template will be applied—any extreme settings will result in false alerts.

Following are the three types of guidelines that are used to determine the metrics that are used by IT Operations Analyzer. These guidelines should be considered when specifying values for threshold metrics.

Guideline Type Description Details

A Thresholds set by an absolute value

Metrics that are classified by this type are defined by using an absolute value (not a percentage) for the threshold. An appropriate threshold must be defined based on the type of node. For example, for an FC switch, the performance limit of “FC Port Sent MBytes/sec” depends on the type gigabytes per second (GBps) capability of the FC port. In the case of a 1 GBps FC port, specify a threshold of 50Mbyte/sec.

B Thresholds set by business requirements

Certain thresholds are determined based on business requirements. For example, a server's performance limit of “Avg Disk msecs/Xfer” (response time) depends on the type of application. If e-mail systems such as Microsoft Exchange or Lotus Notes use the volume, then response time should be less than 20 msecs. But if rich media—such as streaming video, and interactive applets downloaded by the user—uses a logical volume, then the response time should be less than 2 msecs. In this case, the rich media read-ahead process requires a high sequential read hit rate, which depends on good physical read I/O response time. For guideline type 'A' thresholds, it may be more difficult to determine the threshold, and sometimes the default threshold may not be appropriate based on guideline type 'P' metrics. We recommend that guideline type 'B' thresholds are determined as business requirement first and correct “Type A” and “Type P” thresholds considering BNA result.

P Thresholds set by a percentage

Metrics that are classified by this type are defined by using a percentage for the threshold. A suggested value is specified within the default Threshold Template.

8

Performance metrics in Analyzer Table 1 lists the performance metrics that IT Operations Analyzer supports, and the corresponding guideline that applies to each metric.

Table 1 Supported performance metrics

Node Type

Metric Unit Guideline Type

Server OS / Storage Vendor / Switch Type

Server CPU Utilization % % P Windows, Linux, Solaris, ESX

Management CPU % % P Windows

Memory Utilization % % P Windows, Linux, Solaris, ESX

Disk Read MBytes/sec MBytes/sec A Windows, Linux, ESX Disk Write MBytes/sec MBytes/sec A

Avg Disk msecs/Xfer msec/transfer B Windows, Solaris

Avg Disk msecs/sector msec/sector B Linux

IP Pkts Sent/sec Packets/sec A Windows, Linux, Solaris, ESX IP Pkts Rcvd/sec Packets/sec A

IP Sent MBytes/sec MBytes/sec A Solaris, ESX

IP Rcvd MBytes/sec MBytes/sec A

FC Frames Sent/sec Frame/sec A Windows, Linux

FC Frames Rcvd/sec Frame/sec A

Disk Free Space MBytes A Windows, Linux, Solaris, ESX

Storage Cntl-0 I/O Response Time msec/transfer A Hitachi

Cntl-1 I/O Response Time msec/transfer A Hitachi

Cache Write Pending % % P Hitachi

Processor Utilization % % P Hitachi

Cntl-0 I/O Utilization % % P Hitachi

Cntl-1 I/O Utilization % % P Hitachi

Cntl-0 Write Cache Hit % % P Hitachi

Cntl-1 Write Cache Hit % % P Hitachi

StoragePool Free Space GByte A Hitachi, Non-Hitachi

Disk Free Space MBytes A Non-Hitachi

Write Cache Hit % % P Non-Hitachi

FC Ports IO/sec transfer / sec A Non-Hitachi

FC Ports MB/sec MB / sec A Non-Hitachi

I/O Rate of LUN(s)

transfer / sec A Non-Hitachi

MB / sec A Non-Hitachi

I/O Rate on Controller

transfer / sec A Non-Hitachi

MB / sec A Non-Hitachi

iSCSI Ports IO/sec transfer / sec A Non-Hitachi

iSCSI Ports MB/sec MB / sec A Non-Hitachi

Switch FC Port Sent MBytes/sec MBytes/sec A FC

FC Port Received MBytes/sec MBytes/sec A FC

FC Port Error Frames/sec Frame/sec A FC

IP Port Pkts Sent/sec Packets/sec A IP

IP Port Pkts Rcvd/sec Packets/sec A IP

IP Port Pkts Error/sec Packets/sec A IP

9

Table 2 lists the performance problems of other IT resources, which were caused by a performance bottleneck. Based on the relationship between the metrics described in this table, a BNA analyzing rule is created.

In Table 2, the threshold calculation is classified based on the following three patterns: Bottleneck Metric: P, Influenced Metric: B Bottleneck Metric: A, Influenced Metric: B Bottleneck Metric: A, Influenced Metric: A Troubleshooting checks that administrators can complete are provided in the case examples, which begin at the end of page 14.

Table 2 Performance issues of other IT resources that are caused by performance metric problems

Node Type

Bottleneck Metric Influenced Metric

Name Guideline Type

Name Guideline Type

Storage Cache Write Pending % P Avg Disk msecs/Xfer, Avg Disk msecs/sector (on Connected Server)

B

Processor Utilization % P

Cntl-0 I/O Utilization %, Cntl-1 I/O Utilization %

P

Cntl-0 Write Cache Hit %, Cntl-1 Write Cache Hit %

P

Write Cache Hit % P Avg Disk msecs/Xfer, Avg Disk msecs/sector (on Connected Server)

B

Switch FC Port Sent MBytes/sec A Avg Disk msecs/Xfer, Avg Disk msecs/sector (on Connected Server)

B

FC Port Received MBytes/sec

A

FC Port Error Frames/sec A

IP Port Pkts Error/sec A Avg Disk msecs/Xfer, Avg Disk msecs/sector (on Connected Server)

C

Server CPU Utilization % (on Host OS)

P CPU Utilization % (on Guest OS)

P

Avg Disk msecs/Xfer (on Host OS)

A Avg Disk msecs/Xfer (on Guest OS)

B

Avg Disk msecs/sector (on Host OS)

A Avg Disk msecs/sector (on Guest OS)

B

10

Best practices for Bottleneck Analysis (BNA) and Resolution In the following sections, the process of identifying and resolving bottlenecks within IT Operations Analyzer is outlined. The solution to the issue might be a change to the system configuration, or a change to the threshold that is used for a particular performance metric. BNA follows the same backend methodology as Root Cause Analysis. To learn more about Root Cause Analysis, please refer to the “Root Cause Analysis White Paper”.

Reviewing BNA results within IT Operations Analyzer The Monitoring module contains BNA results. From the Monitoring Menu, located on the left panel of the module, the RCA Performance snapshots section lists all past RCA results that were reported, which are indicated by a time stamp, and an Analyzing entry that processes recent BNA items. This is illustrated in Figure 3.

Figure 3 RCA Snapshots

After reviewing RCA Snapshots, administrators should check the topology of the related IT components and their status to confirm the certainty of the result: View the details of a selected snapshot, then click Show Topology, as shown in the following example.

An RCA snapshot includes a list of the top

possible contributors for the detected event. Each

root cause is ranked by certainty: A root cause

that has 5 stars is the most likely to have been the

source of the problem, and will be at the top of the

list. Entries that have fewer stars are less likely to

have contributed to the problem.

This ranking system helps you to focus your

troubleshooting efforts on those causes that are

most relevant.

11

Figure 4 Show Topology button

The RCA snapshot topology provides an at-a-glance view of the detected events and the current performance status of the related components. In the Monitoring module, detected events are displayed at the right-side of the pane and the statuses of impacted nodes are shown in the topology area, as shown in Figure 6.

Figure 5 Topology View

Details about the node in the Topology View is displayed in the lower, tabbed pane. As shown in Figure 6, the Events tab indicates the number of times that the performance event has occurred. Within the Performance tab, performance trending for each metric can be reviewed.

12

Figure 6 Event details

Figure 7 Detected Faults panel, undocked

The Detected Faults panel provides

information about the node on which the

event occurred and all events that were

identified. You can move the panel freely

by clicking Float. For an analysis of the

event and its contributors, click Popup.

When you click Popup, you open a

window that provides a detailed view of

the event, as shown in the following

example. The Overview tab provides a

summary of the root cause, describes event impacts to other nodes, and event symptoms.

13

About the Overview and Performance Data Tabs

The Overview tab, shown below, provides details about the root causes of the failure.

Figure 8 Overview tab

Information is grouped as follows:

Root Cause This area indicates the date and time of the Snapshot occurrence and

provides information about the root cause node, root cause component, and the

recommended action to take to resolve the issue. If the snapshot relates to a

performance item, then the corresponding metrics that were specified for warning and

error thresholds are listed. The root cause is ranked using a star indicator, from a half-

star to five stars: A total of five yellow (warning event) or red (critical event) stars indicate

the greatest certainty that the root cause is the direct source of the issue. As the number

of highlighted stars decrease, so does the certainty.

Affected Resource Groups This area lists nodes by Resource Group and the impact of

the failure on those nodes. Visual cues indicate the severity of the impact: Unreachable

(red icon with white bolt), Critical (red icon with x), Warning (yellow icon), or Normal

status (green icon). Note that if there are no associated Resource Groups, then this

portion of the panel is closed.

Symptoms This group lists each event that occurred on the node, the data and time of

the occurrence, the associated resource group, it provides a link so administrators can

acknowledge the event.

If the event is performance-based (for example, a threshold was reached, and an error occurred),

then, an additional Performance Data tab is available, as shown in the following example.

14

Figure 9 Performance Data tab

The tab is split into two views:

The upper Performance Metrics Comparison panel displays a graphical representation

of the comparison of performance trends over a period of time, leading to the error event.

If you select an instance that is either Ignored or for which data is Not Collected, then

no critical line displays. If the instance is Not Collected, then only the data that was

obtained when information was collected is displayed.

The lower Symptoms panel lists all alert messages relating to the event, and the details

about each one:

State This is an indication of whether the event has been reviewed, and

acknowledged (Ack) or it is not acknowledged (Not Ack).

Description This is the message that IT Operations Analyzer generated when the

event occurred.

Date/Time This indicates the date and time at which the event occurred.

Category Indicates that the event is related to performance.

Source This is the node where the event occurred.

Groups This indicates the Resource Group that is associated with the node where

the event occurred.

Device Indicates the type of device; for example, a server.

Case1: Performance problem at the FC switch port Figure 11 illustrates that IT Operations Analyzer has detected performance problems of the FC switch port, which has caused a slowdown to the server disk drive response time within the FC SAN environment.

15

Figure 10 Disk performance error caused by an overflow of the network bandwidth at the FC switch port

If “FC Switch Port Bottleneck” is indicated in the Description of an RCA result, then click Show Topology to view the network topology. From the Detected Faults panel, click Popup to review the performance details. Table 3 outlines the troubleshooting checks that should be completed for this example performance problem. Table 3 Recommendations to resolve the “FC Switch Port Bottleneck” with a high load at

the switch port

Server Disk Error Troubleshooting Checks

Appears at almost all affected nodes

The threshold for the switch port is appropriate. Check the following to resolve the performance problem:

Check if the application that is using the affected server's logical disk has an issue.

If there is a problem with the application, consider distributing some of the I/O load to the other switch port by changing the path setting of the server, switch, and storage.

Does not appear, or appears at a few nodes

The threshold for the switch port may not be appropriate. The threshold setting for the event-detected component is too severe

for the component. Apply an appropriate setting. If the issue recurs, use a higher threshold setting for the switch port,

by updating the threshold template.

When any performance problem event of the switch port is not detected, administrators should check the recommendations that are outlined in Table 4. Table 4 Recommendations to resolve the “FC Switch Port Bottleneck” without a high load

at the switch port

Server Disk Error Troubleshooting Checks

Appear at almost affected nodes

The threshold for the switch port may not be appropriate. The threshold value which was used to determine the performance

problem, and which generated an event, may not be appropriate—it may be too general.

If this issue recurs, then lower the threshold of the switch port, by modifying the threshold template.

Appear at few nodes

The threshold for the switch port is appropriate.

FC - SW Port

Pfm ERR

HDD Pfm ERR

FC - SAN

storage subsystem

storage subsystem

HDD Pfm ERR

HDD Pfm ERR

× ×

× ×

× ×

16

Case2: Performance problem at the storage controller Figure 11 illustrates that IT Operations Analyzer has detected performance problems with the Storage controller, which has caused a slowdown of the response time on the server disk drive within the FC/IP SAN environment. Figure 11 Disk performance error caused by an overflow of the network bandwidth on the

storage controller

If “Storage Controller Bottleneck” is indicated in the Description of an RCA result, then click Show Topology to view the network topology. From the Detected Faults panel, click Popup to review the performance details. Table 5 outlines the troubleshooting checks that should be completed for this example performance problem. Table 5 Recommendations to resolve the “Storage Controller Bottleneck” with a high load

at the storage controller

Server Disk Error Troubleshooting Checks

Appear at almost affected nodes

The threshold for the controller is appropriate. Check the following to resolve the performance problem:

Check if the application that is using the affected server's logical disk has an issue.

If there is a problem with the application, consider distributing some of the I/O load to the other controller by changing the path setting of the storage.

Not appear, or appear at few nodes

The threshold for the controller may not be appropriate. The threshold value of the controller may be too strict. If this issue recurs, then increase the threshold of the controller for the

storage, by updating the threshold template.

When any performance problem event of the controller is not detected, administrators should check the recommendations that are outlined in Table 6. Table 6 Recommendations to resolve the “Storage Controller Bottleneck” without a high

load at the storage controller

Server Disk Error Troubleshooting Checks

Appear at almost affected nodes

The threshold for the controller may not be appropriate. The threshold value of the controller may be too strict. If this issue recurs, then lower the threshold of the controller by

updating the threshold template.

Appear at few nodes

The threshold for the controller is appropriate.

Controller Pfm. ERR Disk Pfm ERR

FC - SAN

storage subsystem

storage subsystem

Disk Pfm ERR

Disk Pfm ERR

× ×

× ×

× ×

×

×

17

Case3: Performance problem on an ESX server's CPU Figure 12 illustrates that IT Operations Analyzer has detected performance problems of the ESX server's CPU, which has caused a high load on the guest OS CPU within the virtual server environment.

Figure 12 Virtual server performance error caused by CPU overload at the host server

If “Server CPU Bottleneck (causes slowdown of VM)” is indicated in the Description of an RCA result, then click Show Topology to view the network topology. From the Detected Faults panel, click Popup to review the performance details. Table 7 outlines the troubleshooting checks that should be completed for this example performance problem.

Table 7 Recommendations to resolve “Server CPU Bottleneck” with a high load on the ESX server's CPU

Virtual Server CPU Error

Troubleshooting Checks

Appear at almost affected nodes

The threshold for the ESX server CPU is appropriate. Check the following to resolve the performance problem:

Check if the application that is using the affected virtual server has an issue.

If there is a problem with the application, consider distributing some of the virtual machines to the other physical server by changing the virtual machine setting.

Not appear, or appear at few nodes

The threshold for the ESX server CPU may be appropriate. The CPU load of ESX server is high. If the CPU error persists,

consider distributing some of the virtual machines to the other physical server by changing the virtual machine setting.

VMWare ESX Server (R)VMWare ESX Server (R)

virtual

server

virtual

server

virtual switch

virtual

server

CPU Pfm ERR

CPU Pfm ERR CPU Pfm ERR CPU Pfm ERR

×× ×× ××

18

When any performance problem event of the ESX server's CPU is not detected, administrators should check the recommendations that are outlined in Table 8. Table 8 Recommendations to resolve “Server CPU Bottleneck” without high load on the

ESX server's CPU

Virtual Server CPU Error

Troubleshooting Checks

Appears at almost all affected nodes

The threshold for the ESX server CPU may be appropriate. Check the following to resolve the performance problem:

Check if the application that is using the affected virtual server has an issue.

If there is a problem with the application, consider appending more CPU and memory resources to the virtual machines by changing the virtual machine's setting.

Appears at a few nodes

The threshold for the ESX server CPU is appropriate.

Summary By using the BNA function of Hitachi IT Operations Analyzer, administrators who are working in a complex IT environment can quickly identify the performance problem for IP-SAN/FC-SAN storage access, the cause of a slowdown to a server's responses, and identify other performance degradation issues.


Recommended