©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Andreas Gutzwiller Presales Consultant, Hewlett-Packard (Schweiz)
Closed Loop Incident Process From fault detection to closure
HP Software and Solutions
Closed Loop Incident Process Solution The CLIP solution is a:
– Highly automated fault detection-to-recovery solution
– Focused on end-to-end service availability and performance
– Reducing mean time to recovery and improves mean time between system failures
5
Neither process can stand alone in today’s IT environments ITILv3 Linkage of Event & Incident Management
Event – A change of state or alert that has significance for the management of a Configuration Item (CI) or IT Service.
Incident – Unplanned interruption, or reduction of quality, of an IT service
IT Service – People, processes & technology deliverable that supports a customer’s business processes
Event Management • Responsible for managing events
throughout their lifecycle. Main activity of IT Operations.
• Event Filtered/Correlated Resolve or forward to Incident Close
Incident Management • Includes any event which, or could,
disrupts a service. From users or IT staff
• Incident -> Categorize /Prioritize -> Diagnose -> Resolve -> Close
6
ITIL Areas Involved in CLIP – Operations Bridge (aka NOC)
• Central coordination point • Manages various classes of events
• Detects incidents • Manages routine operational activities
• Reports on the status and performance • May provide first-level support for those
events which generate an incident
“The Service Desk is not typically involved in Event Management … unless the Service Desk and Operations Bridge have been combined”
– Service Desk • Single central point of contact for all
users of IT
• Logs and manages all incidents, service requests and access requests
• Provides interface to all other Service Operation processes and activities
Traditional Incident Management From diagnosis to resolution
Multiple un-integrated systems and data stores, manually coordinated hand-offs → inconsistent troubleshooting, high MTTR
Identify service performance degradation
1
Troubleshoot problem to
isolate root cause
2 Identify
actionable condition /
changes to be implemented
3
Create TT/RFC to implement
change
4
Implement and automate change
to close RFC
5
Update CMS (Federated CMDB)
6
End User CMDB “Fire Storms” Help Desk
1. Service performance notification
2. Gather data to assign SME
3. Bouncing the incident
4. Ticket is finally assigned to the correct SME
5. Impact analysis and change management
6. Update CMDB - timely & correctly?
SME: Subject Matter Experts 7
9
Closed Loop Incident Process solution for ITIL Event and Incident Management
From Fault Detection To Recovery & Closure
ITIL Process Event Management Incident Management
Event Generation & Detection
Event Correlation & Business
Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
10
Closed Loop Incident Process solution for ITIL Event and Incident Management
Event Generation & Detection
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
Operations bridge console collects events & alerts from servers, networks, apps & 3rd party Challenge
Bottom-up alert and event overload Lack of qualitative cross domain “actionable”
and causal event data
Solution All events come to one place, correlated and
enriched against an auto-updated service model
User Example – Events to single console End user experience slow SQL slow query performance alert J2EE DB collection pool issue
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
11
Closed Loop Incident Process solution for ITIL Event and Incident Management
Event Correlation & Business Impact
Business services, business impact relationship, and SLAs determined Challenge
Struggle to link causal events to top down end-user experience and business impact
Solution Proactive end-user experience linked to
business process and business transaction flow to identify high revenue generating service impact
User Example - Cause from symptoms and impact Oracle database is the cause, topology based
correlation Critical funds transfer business service
impacted
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
12
Closed Loop Incident Process solution for ITIL Event and Incident Management
Incident Submission
Automatic submission to service desk with annotations and cause area Challenge
Quality and enrichment of data Siloed, broken service lifecycle Duplication of effort wasting time
Solution Better collaboration Automation and integrated of event to incident
process lifecycle
User Example - Automatic incident ticket creation Ticket visible to ops bridge Assignment to subject expert
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
13
Closed Loop Incident Process solution for ITIL Event and Incident Management
Investigation & Diagnosis
Problem isolation, SME tools, and KM used to determine root cause Challenge
Significant problem resolution time spent on pinpointing problem in a dynamic heterogeneous IT universe
Incident assigned and reassigned to multiple silos
Solution Cross domain data visualization and analysis
User Example - Diving deeper to find root cause Expert sees corrupt DB tables Finds runbook automation fix in
knowledgebase
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
14
Closed Loop Incident Process solution for ITIL Event and Incident Management
Resolution
Change request with attached run book automation to repair CI’s Challenge
Little or lack of automation leads to increased manual efforts impacting quality and efficiency
Solution Expert created/authorized run book
automation to empower lower level teams Manage change, configuration, and release
process User Example - Processing the change
Get change request approval Use runbook to reindex database tables
Event Generation &
Detection
Event Correlation &
Business Impact
Incident Submission
Investigation & Diagnosis
Resolution
Recovery & Closure
15
Closed Loop Incident Process solution for ITIL Event and Incident Management
Recovery & Closure
Automatically close incident & related incidents acknowledging related events Challenge
Struggle to improve speed of restoration, recovery and closure of incident and verify post compliance of SLA/OLA
Solution Automate all notifications & updates,
continuously monitor SLA/OLA compliance User Example – Verify the change worked
User, DB and connection pool OK Ticket and events closed
Integrated ITIL event and incident management process optimizing MTTR and MTBF Closed Loop Incident Process Integration Points
Service Desk
Integrated CMDB
Automation
Monitoring
1 2
3 5
1
5
1. Sharing CIs, topology and state information 2. For creating and updating incidents 3. For updating events 4. Incident-, Problem- and Change-Mgmt 5. Runbook automation to remediate
17
4
Integrated ITIL event and incident management process optimizing MTTR and MTBF HP’s Closed Loop Incident Process Solution
Service Manager
UCMDB
Operations Orchestration S
A
Other
CA
NA
SE
BSM
CIs, Topo, Events, Status N
et
Ops
App
Other
1
2 3
4 5
6
7
1. CIs, topology, events, status measurements flowing into BSM
2. Sharing events and topology 3. For creating and updating incidents 4. To access Business Impact View for a CI
5. Runbook automation to enrich, diagnosis and remediate
6. Sharing CIs and state information 7. Runbook automation to remediate
18
Closed-Loop Incident Mgmt Process Incident management from diagnosis to automated resolution
• Key processes—incident, change and configuration—need to be tightly linked • Seamless process linkage requires tools to be consistently service-oriented
IT service management
Business service automation
Configuration Management System (Federated CMDB)
Business service management
1. Identify service performance issue
3. Create RFC to make change
2. Gather data to identify root cause
4b. Review, assess, plan and govern change 5a. Implement change
Identify service performance degradation
1 Troubleshoot problem to isolate root
cause
2 Identify
changes to be implemented
3 Create TT/RFC to implement
change
4 Implement and
automate change to close
RFC
5 Update CMS (Federated
CMCB)
6
6. Update Configuration Management System
4a. Initiate change
5b. Close change request?
20
Drive innovation value of IT Closed Loop Incident Process Key Benefits
Cost • Drive efficiency through automation • Optimize service lifecycle process efficiency
72% lower maintenance cost
Quality • Eliminate error-prone manual tasks • Predict and prevent negative business impact
2.5x increased availability and performance
Transparency • The cost/value ratio of delivered services is understood by the business
• Any service from everywhere
99.5% availability via integrated delivery
Agility • Saved labor can be spend on innovation • Measure and optimize time to develop and successfully
deploy new services
30% faster time to market for new apps
Business risk
• Reduce risk of failure when deploying changes • Enable compliance
70% fewer bad changes
21