+ All Categories
Home > Documents > Complex problem determination cases in real world, Hiroki ...

Complex problem determination cases in real world, Hiroki ...

Date post: 23-Jun-2015
Category:
Upload: tess98
View: 269 times
Download: 0 times
Share this document with a friend
Popular Tags:
16
© 2006 IBM Corporation Complex problem cases in real world Hiroki Nakamura DRO, IBM Japan
Transcript
Page 1: Complex problem determination cases in real world, Hiroki ...

© 2006 IBM Corporation

Complex problem cases in real world

Hiroki NakamuraDRO, IBM Japan

Page 2: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation2

Contents

1. Summary2. Customer Requirements and Impacts (TAT, workload, cost, etc.)3. Case 1 : Resolved complex case: DB2 calculation error4. Case 2 : Resolved complex case: MQ connection error5. Case 3 : Resolved complex case : MQ connection timeout6. Case 4 : Un-resolved case: WPS/LDAP CPU 100% utilization7. Case 5 : Un-resolved case: Web application timeout8. Cause code analysis9. Cost and product lifetime

Page 3: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation3

Summary

Need to reduce resolution TAT and workload of account team to improve customer satisfaction

Built-in trace/diagnostic code without large performance down Detail description in problem fix database to search more easily Performance monitoring and its problem detection function Ease of use core dump inspection for crash/hung cases PD enablement (on/off) without restart a product

More effective PD schemes are required

Page 4: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation4

Customer Impacts and Requirements

Waste large amount of time for PD and recreation test Even if a fix is provided, account team need very long time regression test

Root cause analysis even if one time problem for some years Source code investigation even if few materials (such as, no trace) Logical scenario of a problem (root cause) to trust a fix Detail information for customer management

– In mission critical cases, Japanese companies tend to be very sensitive in quality

– Especially financial company, because of strict guide form Financial Service Agency– Frequent progress reports of a problem resolution required, every three hours, daily or etc.

– A fix code might not be applied if low occurrence and easy recovery could be guaranteed Recurrence test to confirm a problem is fixed by a provided solution Special build is preferable to Fixpac because it’s single fix and no long term

regression test needed Direct communication channel to laboratory change team, who makes a

solution

Impacts

Requirements

Page 5: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation5

What is a Problem? Unexpected results were produced by

SQL calculation– No hardware error detected

– The problems are 100% reproducible, but symptoms are different each time

– The SQL is very large and the data is considerably huge.

– It takes about 30 hours for the calculation.

– Application debug code takes about 4 days.

– No reproducible in IBM

– No reproducible with small data

DB2 Problem ?

Frequent occurrence of parity error on a FC adapter can lead to two-bit error.

– Parity error (single bit) can be recovered by retry access.

– Two bit error is not detectable and recoverable and can cause inconsistent behavior.

– Data corruption or wrong calculation Some customers were suffered from this problem

– A communication company

– An insurance company Temporary error was not recognized as a severe

H/W error, because it’s recoverable.– Cause long term problem determination of

DB2 instead of H/W

Double bit Parity Error of H/W can lead to a problem

Resolved Case 1 : DB2 calculation error

Page 6: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation6

Incident ▼ ▼

▼Replace FC Adaptor

▼Add CPUs and memory

▼No FC Error Found

▼Service-in

▼Application Errors Found

HW Temporary Error

▼Problem Support Request

▼Critical Situation Process

▼No problem found in application

FA (76 days)▼ ▼

▼Report to the CustomerSituation Close

▼Report to the Customer

▼Confirmed no H/W error

Situation Chronology

TAT/WL 93days / 381 person days

Product Fiber Channel adapter in pSeries

Problems Found inconsistent results every time the same SQL executed with the same data.There are two system with the same configuration. A system produces proper results anytime, but the other system produces wrong results.

Frequency SQL application error happened. Reproducible

Reasons for long term

Takes about 30 hours for reproduction and 4 days with application debug code, because of considerably large quantity of SQL and data.

Resolved Case 1 : DB2 calculation error

timeframe

Page 7: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation7

Who has ownership ?Resolved Case 2 : MQ connection error

MQ ServerRoutersSwitches

HostBroad band

EthernetRoutersSwitches

MQ get connection error !!

IBMNonIBM IBM

NonIBM

NonIBM

•Initial problem is MQ get connection error.•There is no similar problem found.•There is a long network path of MQ connection.

Page 8: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation8

Packet Capture to analyze network

L2SW#1

MediaConverter 1A

MediaConverter 1B

L3SW#1

L2SW#2

MediaConverter 2A

MediaConverter 2B

L3SW#2

External F/W#1 External F/W#2

MQ Server #1 MQ Server #2

Fiber

UTP

UTP

L2SW#3 L2SW#4

Back F/W#1 Back F/W#2

Broad Band EtherRouter #1

Broad Band EtherRouter #2

L2SW#5 L2SW#6

L3 F/W, router

Router

Host

Broad Band Ethernet

Intranet

Firewall x 2Host

: Data capture point

Sent reset

packet

Resolved Case 2 : MQ connection error

Red line : connection path

Bug in SYN

Defender function

Page 9: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation9

Symptom

APL Server

MQ Manager

WebServer

GatewayServer

ClientChannelTreads

Disconnected only channels from Application Server

Send Channel Process

Receive Channel Process

Send Channel Process

Receive Channel Process

Send Channel Process

Receive Channel Process

WebServer

WebServer

GatewayServer

a MQGet requestper 3sec

Client channel thread is createdAccording to a request from Gateway Server

Resolved Case 3 : Connection Timeout

1. Connection timeout happened between Web server and Application server.2. Only MQ channels from application server were disconnected.3. Investigated MQ log and trace No error found in MQ4. Investigated AIX trace MQ threads are waiting for a lock5. A lock owner thread is not dispatched for a long time.6. And then connection timeout occurred.

Page 10: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation10

How to determine configuration ?Resolved Case 3 : Connection Timeout

CPU 1

CPU 2

K_T

K_T

K_T

VP

VP

VP

P1_T1

P1_T2

P1_T3

P1_T4

・・・P1_T99 Lock Owner

・・・

K_T : Kernel Thread, VP : Virtual ProcessorP#_T# : Process ID and Thread ID(Client Channel Thread)

Processor wide CPU assignment

Lock Wait : Check process

Sleep

Lock Wait : Check process

Lock Wait : Check process

CPU 1

CPU 2

VP

VP

VP

P1_T1

P1_T2

P1_T3

P1_T4

・・・P1_T99 Lock Owner

・・・

System wide CPU assignment

Lock Wait : Check process

Sleep

Lock Wait : Check process

Lock Wait : Check process

Bottleneck

Change Configuration

Wait for CPU dispatch for a

long time

Page 11: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation11

Problem sequenceLogin

Request

Web Browser Integrated Authentication Portal Server

LDAP

AIX AIX

AIX

TAM/WebSEAL

WAS

WPSInterceptor

UDBUser

Registry

Create Cookie

Request

Transfer

AuthenticationRe-Authentication Retrieve Group Info

PortalScreen

x 3 x 2

HACMP(Active Standby)

Directory Server

9:089:00

Portal #1

LDAPMaster

CPU Util

100

Portal #2

100

100

LDAPBackup

100

10:07-9 10:37-54 11:20-13:05

100%

100%

100%

100%

100%

UID/PW

Un-resolved Case 4 : CPU utilization 100%

Page 12: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation12

Portal Server

AIX

WAS

WPSInterceptor

Re-Authenticatoin

Retrieve Group Info

CreatePortal

LDAP

AIX

Directory Server

UDB

9:089:00 10:07-9 10:37-54 11:20-13:052006/1/17

Request

Response

Smooth communication(Request/Response)

Request

Response

Delayed Processin Portal Server

Normal Process

Request

Response

Discard response from LDAP because of no preparation.Inconsistent requests were left in Portal Server.

Take longer toreceive response

moredegradation

Reqeust

Portal Server re-sent inconsistent requests many times.So LDAP server became overdrive.

Prompt Reply

Re-Request with inconsistency

・・

Overdrive byRe-Request withinconsistency

100%

Request

Response

Recovered by reboot

Receive response before preparation

LogicFlaw?

LogicFlaw?

Supposed Scenario

Take longer toreceive response

Prompt Reply

Request with inconsistency

100%

Prompt Reply

Un-resolved Case 4 : CPU utilization 100%

•Based on Solution assurance review•No occurrence in reproduction test in a customer test machine.

Page 13: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation13

DB Server

AIX

UDBEngine

TC

P/

IP

Agent

Agent

Agent

Agent

Web

S

erve

r

Other Unix

ServletEngine

Servlet

Ap

plic

ation

TC

P/

IP TC

P/

IPPC terminal

IE

CL

I D

rive

r

Hub Server

Appl Server

Ap

plic

ation

TC

P/

IP

CL

I

Driv

er

TC

P/

IP

Agent

UDBEngine

db2tcpcmConnect

SQL

Terminate

Serv

let

Ap

plic

ation

コール

Timeout threshold120 sec

Return

Connection and termination by each SQL executionbecause of host base legacy application

2006.02.23

Un-resolved Case 5 : Application timeoutConfiguration•Problem is an application timeout.•The application is legacy.•There are some components by other companies.•It’s very difficult to gather data to analyze.

Timeout

Page 14: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation14

Con

nect

ions

Time

DB Server ConnectionsSymptom 1) increase of connection (5 to 32), wait for 16sec and connection became 46 2) connection became 0, wait for 28sec and connection became 45  3) connection timeout happened in some cases, even if there is no rapid connection increase

Time Connection Time Connection10:21:40 10 12:58:23 510:21:41 12 12:58:24 310:21:42 12 12:58:25 310:21:44 11 12:58:26 310:21:45 5 12:58:27 310:21:46 8 12:58:29 310:21:47 5 12:58:30 310:21:48 2 12:58:31 210:21:49 5 12:58:32 110:21:50 5 12:58:33 010:21:51 32 12:59:01 45 28sec wait10:22:07 46

16sec wait12:59:02 32

10:22:09 38 12:59:03 2610:22:10 38 12:59:04 2010:22:11 38 12:59:06 1610:22:12 34 12:59:07 1110:22:13 33 12:59:08 310:22:14 30 12:59:09 410:22:15 32 12:59:10 210:22:16 33 12:59:11 3

Rapid increase of connectionsUn-resolved Case 5 : Application timeout

•Investigation is from DB2, because other components are owned by other company.

•No body can explain long connection wait time and rapid increase of connections.

•DB2 trace affected very heavy CPU utilization, which cause many timeout.

DB2 trace and AIX trace are taken, but not resolved

Page 15: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation15

Severe Problems Causal Analysis (long aged )

Others, 12, 27%

Long PD time by Lab8, 17%

Difficult reproduction7, 15%

Side effect by a fix3, 7%

High error rate

3, 7%No guidance ofImportant info3, 7%

Low quality2, 4%

Improper communicationwith lab , 2, 4%

Improper communicationwith customer 2, 4%

Need to accelerateresolution, 2, 4%

Improper problemmanagement

2, 4%

N=15Code=46

Page 16: Complex problem determination cases in real world, Hiroki ...

DRO IBM Japan

Complex problem cases in real world Unclassified © 2006 IBM Corporation16

Cost of Poor Quality (may be common understanding)

% DefectsIntroduced inthis phase

Coding UnitTest

FunctTest

FieldTest

PostRelease

% Defects found inin this phase

Per

cent

age

of B

ugs

85%

$ Cost torepair defectin this phase$25

$250

$14,000

$1000

$130

Source: Applied Software Measurement,

Capers Jones, 1996


Recommended