+ All Categories
Home > Documents > incident analysis - procedure and approach

incident analysis - procedure and approach

Date post: 28-Nov-2014
Category:
Upload: derek-chang
View: 1,490 times
Download: 1 times
Share this document with a friend
Description:
 
37
How to walk away from your Outage looking like a HERO Teresa Dietrich, Vice President Technology Derek Chang, Director Site Reliability Engineering
Transcript
Page 1: incident analysis - procedure and approach

How to walk away from your Outage looking like a HERO

Teresa Dietrich, Vice President Technology Derek Chang, Director Site Reliability Engineering

Page 2: incident analysis - procedure and approach

2

Who we are and Why we are here….

Teresa Dietrich – VP of Technical Operations @ WebMD, previously with AOL, @teresadg (Twitter), www.teresadietrich.net

Derek Chang – Director of Site Reliability Engineering aka SRE @WebMD, experience in Development, WebOps and CMS www.derekchang.me

We are passionate about Outages, Process & Procedures and Always making new mistakes!!

Page 3: incident analysis - procedure and approach

3

About WebMD

• Most Recognized & Trusted Brand of Health Information

• Serves consumers, physicians, other healthcare professionals, employers and health plans.

• 107 million visitors/month on both desktop and mobile platforms

• 2.5 billion page views/month

Page 4: incident analysis - procedure and approach

4

What is An Outage?

Service is unavailable to users or to a subset of users Service is unable to function as designed and implementedDegradation of service to the point the resource is unusable (Defined SLAs)

Page 5: incident analysis - procedure and approach

5

Why do Outages happen?

Bugs in OS, middleware, and applicationHardware failureInfrastructure failure (Network, SAN)Environment failures (Power, Cooling)Human ErrorDemand exceeds capacityMalicious attacks

Page 6: incident analysis - procedure and approach

6

How are Outages exacerbated?

Too long for monitoring to catch the issueMonitoring does not catch the issue, humans eventually doToo long to alert appropriate people of issueToo long for people to respond to alertsToo long to find the cause or source of the issueTo long to resolve the issueLack of communication to Internal and External customersMultiple failure scenario

Page 7: incident analysis - procedure and approach

7

A different way to do a Post Mortem

Focus on improving processes and systems for future, not assigning responsibility for the outage.Structure, structure, structure!Discover, Analyze and ReviewAnalysis done by a third party engineer with DevOps experience @ WebMD.Data collected in a prescribed and orderly fashion, using a template.Recommendations for improvement owned, assigned and tracked through resolution.

Page 8: incident analysis - procedure and approach

8

Incident Analysis Template 1

You can download the template @ www.teresadietrich.net

Page 9: incident analysis - procedure and approach

9

Incident Analysis Template 2

You can download the template @ www.teresadietrich.net

Page 10: incident analysis - procedure and approach

10

Incident 1 – background info

Page 11: incident analysis - procedure and approach

11

Incident 1 – outage resolution

Page 12: incident analysis - procedure and approach

12

Incident 1 – timeline analysis

Page 13: incident analysis - procedure and approach

13

Incident 1 – timeline analysis

Page 14: incident analysis - procedure and approach

14

Incident 1 – recent application builds, changes and maintenance

Page 15: incident analysis - procedure and approach

15

Incident 1 – log analysis

Page 16: incident analysis - procedure and approach

16

Incident 1 – log analysis

Page 17: incident analysis - procedure and approach

17

Incident 1 – monitoring correlation

Page 18: incident analysis - procedure and approach

18

Incident 1 – monitoring correlation

Page 19: incident analysis - procedure and approach

19

Incident 1 – root cause analysis

Page 20: incident analysis - procedure and approach

20

Incident 1 – root cause analysis

Page 21: incident analysis - procedure and approach

21

Incident 1 – root cause analysisIt's caused by a known Oracle bug 5181800 specifically on oracle version 10.2.0.2.

About LNS: LNS (log-write network-server) and ARCH (archiver) processes running on the primary database select archived redo logs and send them to the standby database (IAD1) where the RFS (remote file server) background process within the Oracle instance performs the task of receiving archived redo-logs originating from the primary database (PHX1)

Page 22: incident analysis - procedure and approach

22

Incident 1 – review and recommendation# Type Review Description Recommendation

RR01

Processno ON clear was sent after outage is cleared

outage update 4 was the last communication

1. Better process for outage communication2. firstaid NMS - notification management system

RR03

Monitoring detection

inadequate monitoring on oracle infrastructure

Currently oracle relies on home-grown script to monitor oracle event queue and send email upon errors. The fact that IAD1 RAC problem (which is the origin of control file lock in PHX1) didn't catch our attention made the troubleshooting a more difficult and longer process.

We should look to third party monitoring tool at hand (e.g. Zenoss) to monitor oracle components and implement oracle GRID control to provide additional monitoring

RR04

Monitor alert

inadequate monitoring on user experience

no alert was sent before/during outage from Gomez and Truesight. We should set up alert from Gomez and Truesight.

RR05

Development request

excessive errors in the application log make it extremely difficult to troubleshoot by log and in turn impact the recovery time

15000 errors on 1/25, 28000 errors on 1/26 and 10000 errors on 1/27 on a single tomcat server

1. review current logging implementation2. log clean up3. operations should review log and provide report

with engineering regularly (bi-weekly or monthly)

RR06

Ops request potential log rotation problem on tomcat server (Medscape www backend farm)

several logs are only 1 kilobytes in size review/correct log setting and rotation script.

Page 23: incident analysis - procedure and approach

23

Investigation Procedures

Page 24: incident analysis - procedure and approach

24

Investigation Procedures

Page 25: incident analysis - procedure and approach

25

Investigation Procedures

Page 26: incident analysis - procedure and approach

26

Incident 2 – background information

Page 27: incident analysis - procedure and approach

27

Incident 2 – Timeline analysis and application profiling

Page 28: incident analysis - procedure and approach

28

Incident 2 – root cause

Page 29: incident analysis - procedure and approach

29

Incident 2 - resolution

Page 30: incident analysis - procedure and approach

30

Incident 2 – Resolution rollout• Research: Further research revealed the Jsp compilation meta data are only stored in JVM when the Tomcat

Jasper engine runs at development mode• Potential business impact: Teams agreed the solution to turn-off development mode under the assumption

that there is no business impact – PJSP update will still function properly• POC: A brief POC test showed non-development mode does reduce memory footprint (memory usage dropped

from 196.2Mb to 61.3Mb and total objects in memory dropped from 2.6m to 876k) and all PJSP updates are recompiled and ready to serve in a short moment.

• Deployment: Zenoss JMX chart showed the memory dropped back close to initial consumption (0.2-0.3Gb) after each GC cycle while with development mode, the memory inflated to 1G in a couple days and GC could not reclaim memory space and tomcat needed to be restarted.

Page 31: incident analysis - procedure and approach

31

Incident 2 – Resolution rolloutFix verification: The fix was applied to the whole farm in production. Since then, the result is good - no more restart due to out of memory space and view article performance is more than 30% better in Truesight (avg. 109.5ms compared to 155.9ms before)

Page 32: incident analysis - procedure and approach

32

Incident 2 – review and recommendation

Page 33: incident analysis - procedure and approach

33

Change people’s reaction to “Post Mortem”

Removing the emotion and blame from the Post Mortem process help minimize the dread and lack of participation. Standard procedures and templates shape people’s expectations and perceptions of the Post Mortem process.With the lead engineer of the investigation having no day to day responsibility with regards to product in question, we can greatly reduced the defensiveness and political stances by those involved.

Page 34: incident analysis - procedure and approach

34

Ensure the lessons are learned

Publishing the results to first to the teams involved and then to the entire technology organization helps with education, openness about the process and accountability for the changes recommended.Take the recommendations, once agreed and approved, and turn them into actionable items: Dev Change Requests, Ops Tickets, Process Update and Communication, Monitoring Change.A single person should own the recommendations becoming action items and responsibility for seeing them through completion. Don’t let them fall by the wayside. During the next outage, try and highlight how the previous lessons improved the next outage, do your own PR for your process.

Page 35: incident analysis - procedure and approach

35

Questions

Time permitting

OR

Office hours

Tuesday June 26 @ 1pm

Page 36: incident analysis - procedure and approach

36

Appendix - Investigation Procedures1. Collect background information

– Scope of impact– Information about the product(s) impacted– Interview personnel involved

2. Initial interpretation– Type of incident – outage, service degradation– Expectation from senior management– Depth and scope of investigation– Resource planning

Page 37: incident analysis - procedure and approach

37

Appendix - Investigation Procedures

3. In-depth analysis– Timeline analysis– Change analysis– Log analysis– Monitoring data correlation

4. Research– Vendor documentation and white paper– Architecture review– Code review and application profiling– Infrastructure review

5. Resolution and recommendation


Recommended