Date post: | 29-Aug-2014 |
Category: |
Technology |
Upload: | andrew-white |
View: | 220 times |
Download: | 0 times |
Follow Us: #ITSMSummit!
Reason #114 For Learning Math: Using Analytics to Improve Service Assurance
Follow Us: #ITSMSummit!
Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command.
Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
http://weheartit.com/entry/12433848!
Follow Us: #ITSMSummit!
Ground rules for this session… • If you can’t tell if I am trying to be funny…
– GO AHEAD AND LAUGH! • Feel free to text, tweet, yammer, or whatever
to share with the rest of the attendees • If you have a question, no need to wait until
the end. Just interrupt me. Seriously… I don’t mind.
I am here today to share some of what I have learned about
Follow Us: #ITSMSummit!
CIO’s turn to innovative technologies to deliver better outcomes
Cloud & Optimized Workloads § Agile provisioning § Elastic compute power § Scalable storage
resources § Intelligent services
Mobile Enterprise § Hybrid mobile "
app development § Multi-channel integration § Device management § Workloads on the move
Security Intelligence § People &
identity § Data &
information § Application
security § Security
analytics
Big Data Analytics § Analyze an enormous variety of information sources § Real-time insights & actions on streaming data
IBM CIO Study (2012)
Follow Us: #ITSMSummit!
Why is problem solving hard? • commencement opacity • continuation opacity
Non-transparency (lack of clarity of the situation)
• inexpressiveness • opposition • transience
Polytely (multiple goals)
• enumerability • connectivity (hierarchy relation, communication relation, allocation
relation) • heterogeneity
Complexity (large numbers of items, interrelations,
and decisions)
• temporal constraints • temporal sensitivity • phase effects • dynamic unpredictability
Dynamics (time considerations)
Follow Us: #ITSMSummit!
Problem Cycle Evaluation
Recognition
Observation
Analysis Solution
Validation
Control
Follow Us: #ITSMSummit!
Point of Observation
Past Behavior • The observation period
used to feed the forecasting models
Future Behavior • The performance
period the model is trying to predict
Predictive Modeling Timeline
Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.
Follow Us: #ITSMSummit!
Analytics 1.0
In the early days, we were just happy to know if the network was up or down.
We suffered from event floods and the perpetually red event console.
Follow Us: #ITSMSummit!
Analytics 2.0 Eventually the technology allowed us to correlate based on topology and filter unnecessary events.
Dashboards were all the rage and were measured in data per square inch.
Follow Us: #ITSMSummit!
Evolution of Analytics
Difficulty
Value
Descriptive Analytics
What Happened?
Diagnostic Analytics
Why Did It Happen?
Predictive Analytics
What Will Happen?
PrescriptiveAnalytics
How Do We Make It Happen?
Adapted from Gartner
First…
… we need to talk a little bit about your brain
Follow Us: #ITSMSummit!
The Triune Brain
Reptilian Brain (basal ganglia)
Mammalian Brain (limbic system)
Cognitive Brain (neocortex)
Follow Us: #ITSMSummit!
Our Thought Process
*** not very reliable
Cognition
Limbic Center (hypocampus and amygdala)
Cortex (hypocampus and amygdala)
Conscious Choice (via motor centers)
Most primitive, seat of unconscious
Long-term memory
Conscious, meaning, choice
Perception (via the senses)***
Pre-Frontal Cortex (hypocampus and amygdala)
Stimulus
Follow Us: #ITSMSummit!
Short Term Memory
Your Brain Working Memory Understanding Judgement Relationship
Short-term memory is where the real work of sense-making takes place
Short-term memory has a limited amount of space (The estimate is 7 ± 2)
Follow Us: #ITSMSummit! Time
Qua
ntity
Information the brain can consume
Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University
Follow Us: #ITSMSummit!
• Patterns • Comparisons • Organization
Information
• Decisions • Skill • Adaptation
Intelligence
• Trends • Generalizations • Beliefs
Knowledge
• Accountability • Foresight • Synthesis
Wisdom
• Symbols • Metrics • Facts
Data Correlation
Analysis
Application
Understanding
Complexity
Con
text
Communication
Repetition
From Data to Wisdom
Follow Us: #ITSMSummit! x
y
0i i i iy xα α ε= + +
Data
Information
Knowledge
Follow Us: #ITSMSummit!
Past Future
Abstract Tangible
Information Intelligence Knowledge Wisdom Data
Knowledge is the point of transition
Why Knowledge?
All You Need
Love
Follow Us: #ITSMSummit!
Models of Reasoning
• Inductive – Starts with Data Available – Concludes with Possible
Hypotheses – Bottom Up “Data Driven
Approach”
Data
Interpreta@on
Theory Development
Hypothesis Tes@ng
Hypothesis
Theory
• Deductive – Starts with Theoretical
Framework – Concludes with Logical
Deductions – Theory Driven Approach
Follow Us: #ITSMSummit!
Two Types of Decision Making
Programmed Decisions – Routine – Repetitive – Well-Structured – Predetermined Decision
Rules
Non-Programmed Decisions – Unique – Presence of Risk – Presence of Uncertainty – Black Swans
Follow Us: #ITSMSummit!
How To Improve Decision Making
• Programmed Decision Making – Collect evidence – Identify the problem – Select a solution – Implement and evaluate the
outcome
• Non-Programmed Decision Making – Narrow evidence down to
the ideal level – Apply heuristics to limit the
impact of cognitive bias – Present options to a human
for a decision
Follow Us: #ITSMSummit!
Four Sources of Bad Decisions
• Failure to frame the problem correctly • Poor use of evidence • Faulty decision making process • No feedback for improvement
Follow Us: #ITSMSummit!
Common Logical Fallacies • Appeals to Authority – where you rely on an expert source to form the basis of your
argument • False Inductions – where you infer a causal relationship where none is evident • Reification – when you rely on taking a hypothesis or potential theory and present it as a
known truth • The Slippery Slope – when you base an argument on the thinking that once one action is
taken, it will trigger a sequence of events that will result in the direst of consequences • The Band Wagon – when you present an argument as true on the basis of its popularity • The False Dichotomy – when you provide only two options and force a choice to be made • The Straw Man – when you create a false argument and refute it implying that the counter
argument is true • Observational Selection – when you draw attention to the positive aspects of an idea and
ignore the negatives • Statistics of Small Numbers – when you take one (or a very small sample) and use it to draw
a general conclusion
The problem is not that there are no silver bullets… the problem is that there are no werewolves. - Jim Tussing, CTO, Nationwide Insurance
Follow Us: #ITSMSummit!
Global Warming and Inflation
Inflation
Global warming
Follow Us: #ITSMSummit!
Hidden Factors
Hidden Factor
Smoking Lung Cancer
Follow Us: #ITSMSummit!
Follow Us: #ITSMSummit!
Boyd’s Loop
Observation
Outside Information
Implicit Guidance & Control
Unfolding Interaction With Environment Feedback
Feedback
Unfolding Circumstances Cultural
Norms
Cognitive Abilities
Knowledge Life Cycle
Prior Wisdom
New Information
Feed Forward Decision
(Hypothesis)
Feed Forward Action
(Test)
Feed Forward
• Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.
• Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.
From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
Observe Orient Decide Act
Follow Us: #ITSMSummit!
Where the Breakdown Occurs
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of Elements in Current Situation!
!Level 1!
Comprehension of Current Situation!
!Level 2!
Projection of Future Status!
!!
Level 3!
Decision! Performance of Actions!
Cur
rent
Sta
te!
Feedback!
• Goals & Objectives!• Preconceptions!• Expectations!
• Abilities!• Experience!• Training!
Long Term Memory! Automaticity!
Cognitive Processes!
• System Capability!• Interface Design!• Stress & Workload!• Complexity!• Automation!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
Systemic Influences!
Individual Influences!
Follow Us: #ITSMSummit!
Sometimes We Miss What is Going On
Say… what’s a mountain goat doing all the way up here in these clouds?
Follow Us: #ITSMSummit!
Rare Events “one chance in a million” will undoubtedly occur, with no less and no more than it’s appropriate frequency, however surprised we may be that it should occur to us. Sir Ronald A. Fisher
© Aquire Inc. 2012
Follow Us: #ITSMSummit!
The Gaussian Bell Curve Mean
-1σ +1σ -2σ +2σ
-3σ +3σ 67%
95%
99.5%
The trick is not to spend our time trying to get better at predicting this world, or making it more predictable, for both of these strategies are bound to fail. - Nassim Nicholas Taleb, Author and Philosopher
Follow Us: #ITSMSummit!
Normative Decision Making Model • Limited Information Collection
– 7 +/- 2 – Tendency to acquire manageable rather than optimal amounts of
information – Difficulty identifying all possible options
• Judgmental Heuristics – Judgmental heuristics - rules of thumb or shortcuts that people use to
reduce information processing demands – Availability heuristic - tendency to base decisions on information
readily available in memory – Representativeness heuristic - tendency to assess the likelihood of an
event occurring based on impressions about similar occurrences • Satisficing
– Choosing a solution that meets a minimum standard of acceptance
Follow Us: #ITSMSummit!
The Analytics Focus… In addition to handling monitoring and performance alerts, it helps drive improved availability.
The Formula: 1. Continually collect, categorize, and analyze all events from as many
sources as possible 2. Correlate events and analyze them using previous outages as patterns
to identify situations worth investigating 3. Notify a support team so the situation can be mitigated before
becoming an outage 4. Automate responses that have well established situational fingerprints
and proven resolution steps
Follow Us: #ITSMSummit!
Most Common Modeling Tasks • Classification: predicting an item class, “decision tree” • Clustering: finding natural groups or clusters in data • Association: finding things that occur together • Deviation: finding changes or outliers • Estimation: predicting values • Linkage: finding relationships among actors • Mining: extracting information from data
Follow Us: #ITSMSummit!
Types of Analytical Algorithms Algorithm Description
Decision Tree Calculating the odds of an outcome Association Rules Identifying the relationships between elements Naïve Bayes Clearly showing the differences in a particular variable Sequence Clustering Grouping data based on a sequence of events Time Series Analyze and forecast time-based data Neural Networks Seek to uncover non-intuitive relationships in data Text Mining Analyze unstructured text data looking for context and meaning Linear Regression Determine the relationship between columns to predict an
outcome Logistic Regression Evaluate the relationship between columns in order to evaluate
the probability that a column will contain a specific state
Follow Us: #ITSMSummit!
Questions Answered by Analytics Business Question Method What is the best that can happen? Optimization What will happen next? Predictive What if this trend continues? Predictive/Forecasting Why is this happening? Variance analysis/Root Cause Is some action needed? Alerts Where is the problem? Query/Drill Down How many, how often, when? Ad hoc reports What happened? Standard reports Value
Understanding what is already known but has not been shown.
Follow Us: #ITSMSummit!
Incident Life Cycle
Down Time
Detection Time Response Time Repair Time Recovery Time Outage De
tect
ion
Diag
nosis
Repa
ir
Reco
ver
Rest
ore
Observe Orient Decide Act
Follow Us: #ITSMSummit!
Anatomy of an Outage
Corporate!LANs & VPNs!
Load Balancer!
Firewall!
Web!Servers!
Message!Queue!
zOS!CICS!
WAS!
Database!
WAS!Database!
zOS!MQ!
DB2!
!!!!
4!
!!!!!!
3!
!!!!!!1!
5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket!
!!!!!!2!
6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics!
6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident!
6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem!
10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue!
!!!!
5!
Follow Us: #ITSMSummit!hKp://www.ithakabound.com/wp-‐content/uploads/2010/02/DC-‐Snow-‐men-‐pushing-‐car.jpg
Why did this happen?!
Follow Us: #ITSMSummit!
The Problem
If no there is no ‘early detection’ before the outage, operations teams can only react while outage is already in effect and already losing money...
Why aren’t operations teams preventative today?
§ Too much data to analyze manually § Existing analytic techniques, such as standard thresholds, are not up to the task § They cannot detect problems while they are emerging (before business impact) § Set threshold too high, insufficient warning before total failure. § Set threshold too low, too much noise, everything is ignored
Follow Us: #ITSMSummit!
Processing Streams
Situational Awareness
Engine
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795
Real-Time Event Streams
Detected and Predicted Situations
Patterns from Historical Data
Causal Relationship from Past RCAs
Follow Us: #ITSMSummit!
Complex Event Processing
Event Pipeline
Event Queries
Time Window
Data Events
Control Event
Other Events
Event Filter
Scenarios
A
B
C
Feedback Loop
Event Intelligence
Action Events
Follow Us: #ITSMSummit!
One Integrated Environment
Distributed Database Mainframe Network Middleware Storage
Event Pool
Operational!Data Warehouse!
Predictive
Enrichment & Correlation
Service Desk Paging
CMDB
Knowledge
Asset Mgmt
Event Catalog
Event API
Business Telemetry
3rd Party Providers
Presentation Framework
Follow Us: #ITSMSummit!
Integrate Your Processes
Presentation Framework
Asset Management & Topology Database
Aggregation and Analysis
Security Management
Availability Management
Configuration Management
Change Management
Performance Management
Enterprise Data Sources
Business Telemetry
Information
Configuration Discrepancies
Enrichment Data Business Activity Data
Historical Data
“Enriched” Events
Change Activity
Topology Snapshots
Tren
d-Re
late
d Fa
ults
Discovered Problem
s
Status Indications
Incidents
Audit Information and Suspicious Activity
Enrichment Data Business Activity Data
Automated Discovery
Follow Us: #ITSMSummit!
Automated Action!
Notification and Escalation!
Business Impact
Analysis!
Root C
ause Analysis!
Correlation and
Event Suppression!
Enrichment!
Distributed C
ollectors!D
istributed Collectors!
LOB Managed Monitoring System!
Service Provider Managed Monitoring
System!
Vendor Managed Monitoring System!
Element Manager!
Element Manager!
Element Manager!
Service Center! Yammer! CMDB! CVOL! APM! KM
Entries! Triage! xMatters!
Visualization!Framework!
Com
mon Event
Format!
Topology And Relationship
Database!Automated
Action Tools!
Distributed C
ollectors!Automated Provisioning
System!
Predictive Analysis!
Automated Change
Reconciliation!
Security Management!
Archive and Report!
Business Telemetry Data!
Service Center and Enterprise
Notification Tool!
Meta-Data Integration Bus!
Follow Us: #ITSMSummit!
Predictive Outage Avoidance
Ensure availability of applica3ons and services
• Use learning tools to augment custom best practices • Leverage statistical methods to maximize predictive warning • Improve problem detection across IT silos
Predict
Faster Problem Resolution
Find & correct problems faster with tools that determine ac3ons
required to resolve issues
• Identify problems quicker with insight to large unstructured repositories
• Isolate problems quicker by bringing relevant unstructured data into problem investigations
• Repair problems quicker with the right details quickly to hand.
Resolve
Optimized Performance
Track, Op3mize, and Predict capacity and performance needs
over 3me
• Track capacity and performance of applications and services in classic and cloud environments • Optimize resource deployment with what-if and best fit planning tools • Escalate capacity and performance problems before they cause critical failures
Perform
Improved Insight Enhance visibility into systems resource rela3onships while
increasing customer sa3sfac3on
• Determine what resources are interdependent to assess impact of failures • Gain insight into what is important to your customer
• Decrease customer churn and acquisition costs while increasing customer retention and satisfaction
Know
Automated Analytics helps lower IT Administration Costs: • Performance and Capacity planning tools monitor appropriately and escalate, reducing
time consuming report browsing • Learning tools reduce customization and best practices investment on initial deployment • Log Analysis helps speed problem resolution to be able to do more with less
Follow Us: #ITSMSummit!
Let’s keep the conversation going…
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmanagementzen.wordpress.com/feed/!
@SystemsMgmtZen!
ReverendDrew!
614-306-3434!