Post on 03-Apr-2018
transcript
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
1/18
2011 IBM Corporation
ID902Occam's Razor: An Introductionto Holistic Troubleshooting
Wes Morgan Senior Software Engineer
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
2/18
2011 IBM Corporation
Agenda
Why are we here?
Increasingly Complex Architectures Specialization within IT/IS
Command and Control Issues
Consequences of Fix it NOW!
The Holistic Approach Occam's Razor Preparation
Understanding Your Deployment
Knowing Your Routine
Knowing Your Limits
Execution
Ask Your Neighbors
Identify/Refine Your Target Problem vs. Routine
Client, Server or Both?
Recent Changes
Lather, Rinse, Repeat...
Questions & Answers
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
3/18
2011 IBM Corporation
Why Are We Here? Complex Architectures
Fault Tolerance/Redundancy Load Balancers
Firewalls
Intranet/Extranet
Virtualization
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
4/18 2011 IBM Corporation
Why Are We Here? IT/IS Specialization
We don't handle that Different team
Communication often rare and/or difficult
Simple questions answered slowly
No one really sees big picture
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
5/18
2011 IBM Corporation
Why Are We Here? Command and Control
We can't do that until the next window Change Control != everyone informed
Software integration demands team integration as well
Multiple vendors/contractors may be involved
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
6/18
2011 IBM Corporation
Why Are We Here? Fix It NOW Consequences
Panic mode Time-to-resolution faces sometimes arbitrary limits
All hands on deck
Overall technical guidance lacking
Troubleshooting becomes scattershot
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
7/18
2011 IBM Corporation
The Holistic Approach Occam's Razor
Pluralitas non est ponenda sine neccesitate.
Plurality should not be posited without necessity.
William of Ockham, c. 1285-1349
Close relatives: When two theories explain the same phenomenon, choose the simpler
admit no more causes..than such as are both true and sufficient... (Newton)
KISS: Keep It Simple, Stupid
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
8/18
2011 IBM Corporation
Why Use Occam's Razor?
Multiple failures highly unlikely Far more likely that one root failure triggered additional problems
Playing it could be introduces complexity and (probably) politics
Don't chase rabbits!
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
9/18
2011 IBM Corporation
Preparation Understand Your Deployment
It's far more than just your stuff Hardware (or lack thereof!)
Operating System
Network (within the data center)
Network (long haul/extranet/VPN)
Dependencies (directory, SAN)
Special-purpose devices (firewalls/proxies/reverse-proxies)
Network appliances
KNOW YOUR DATA PATH!
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
10/18
2011 IBM Corporation
Preparation Know Your Routine
Profile your systems! perfpmr (AIX), perfmon (Windows), iostat/vmstat (Linux)
Understand what normal looks like
Be sure to profile peak time too!
Logins/sessions per day
User patterns (e.g. Accounting end-of-month)
Domino platform statistics can be VERY useful
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
11/18
2011 IBM Corporation
Preparation Knowing Your Limits
Compare your routine use to: Vendor benchmarks
Third party testing/whitepapers
Software specifications
Know how much wiggle room you have CPU utilization
RAM consumption
ESPECIALLY important in virtual environments
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
12/18
2011 IBM Corporation
Execution Ask Your Neighbors
Many deployments in your environment share potential points of failure Load Balancers
SAN
Quick check with peers may identify common problem quickly
Formalize this process if you can weekly outage reports?
May also be indicative of general network issues
Allows you to handle some issues without vendor involvement
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
13/18
2011 IBM Corporation
Execution Identify/Refine the Target
Most missed aspect of troubleshooting Identify scope/range of affected users
Identify scope/range of affected servers
LOOK FOR COMMON FACTORS! Third-party applications
Same location Same release
Time of day
Check for customizations
Follow the data flow!
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
14/18
2011 IBM Corporation
Execution Problem vs. Routine
Take a snapshot of the problem Compare it to routine data
May identify particular areas of concern
May allow vendor to focus their efforts better/faster
Examples: Domino NSD NAMElookup activity
Perfmon/perfpmr/iostat disk queuing
Pay particular attention to period just BEFORE problem (last 10 minutes)
Be prepared to be pointed in a different direction!
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
15/18
2011 IBM Corporation
Execution Client, Server or Both?
DON'T GO AFTER A FLY WITH A SLEDGEHAMMER! Resist the urge to turn on all the debug
Overly ambitious debug can present its own performance cost DEBUG_TCP_ALL in IBM Lotus Domino
VP_TRACE_ALL in IBM Lotus Sametime
debug=FINEST in Java It's worth a round of data gathering to target server debug more specifically
High-level client-side debug correlates well with trace logs Live HTTP Headers (Firefox add-on)
Firebug (Firefox add-on)
Fiddler (MSIE proxy) Again, gather twice - routine and problem - when possible
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
16/18
2011 IBM Corporation
Execution Recent Changes
Back to Change Control Look for ANY changes close to start of problem
Don't forget to check for OS patches/updates
Look for new stuff too...
Check all along the data flow
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
17/18
2011 IBM Corporation
Lather, Rinse, Repeat...
Be prepared to cycle through this process several times Apply same principles to each area of troublehsooting
Example: Identify/Refine shows only particular users suffering
Logs show directory issues
Now, users not experiencing problems are routine Troubleshoot directory by comparing problem users against routine users
e.g. get LDIF dumps for both
Only go where the evidence takes you!
7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting
18/18
2011 IBM Corporation
QUESTIONS & ANSWERS
Please complete a session evaluation!
More questions? Find me in the Lotus SolutionsDevelopment Lab!
THANKS FOR BEING HERE!