Date post: | 08-Feb-2017 |
Category: |
Software |
Upload: | intellimagic |
View: | 112 times |
Download: | 1 times |
1
Transform RMF/SMF intoAvailability Intelligence
Brent Phillips, Managing Director, AmericasJerry Street, Senior Performance Consultant
www.intellimagic.com
2
Today’s Agenda
1. z/OS infrastructure availability2. The “Availability Intelligence” concept3. Sample use cases4. How to create intelligence about threats to availability5. Availability Intelligence as a Service
3
Session Abstract Using Availability Intelligence to Better Protect the Production Site
• It is time for a new, more intelligent approach to interpreting the RMF & SMF data. One that provides a dramatically different result that you can easily verify on your own data.
• RMF & SMF produce the world’s richest source of machine-generated data about enterprise infrastructure performance and configuration. But even the best run shops are not able to use this data to avoid incidents causing unavailability.
• To outsmart unavailability, you have to automatically “crawl” through all the workload data every day at a very granular level. This data needs to be enriched and constantly evaluated against detailed expert knowledge about the infrastructure.
• Using the automatic application of expert domain knowledge to mine and interpret the data, you can the risk in your infrastructure to handle your peak workloads, and how that risk is changing over time. This new visibility gives you warning before your online monitors can even detect any disruption to service levels.
4
We are inspired by creating intelligence
that illuminates the risks hiding inside your IT infrastructure.
“Any sufficiently advanced technology is indistinguishable from magic”
Arthur C. Clarke, 1962
5
1. z/OS Infrastructure Availability
6
Availability on z/OS Systems• What does the “z” stand for?
“zero downtime”
• What is your availability?
• z/OS vs. end-user experience
7
z/OS Infrastructure Areas• Many components required:
‒ Processor, Memory, WLM Goals, etc.‒ Channels‒ Coupling Facility‒ XCF‒ FICON‒ Disk Storage‒ Replication / DR / GDPS‒ Tape / Virtual Tape Storage
8
Infrastructure Availability Today: Either Good or Bad
Full Engageds
Little
Panic Hard to focus
Stress level
BrainStatus
Available
9
The Missing Stage: About to Be Bad
Little
Healthy
Panic
Engaged
Hard to focus
Stress level
BrainStatus
Available
10
Seeing Threats to Continuous Availability• Question: Which has better intelligence to avoid outages:
‒ A 20 thousand Dollar automobile; or ‒ A SAN storage infrastructure costing millions of Dollars?
11
Predictable
Unpredictable
Incidents Leading to Application Unavailability
Response for Unpredictable:• Find the problem quicker• Accelerate the
problem fix
Response for Predictable:
• Avoid incident with proactive action
12
Increasing the Predictable Portion
Predictable
Unpredictable
What would be the impact on:1. Your IT staff?2. Your Employees?3. Your Customers?
13
2. Availability Intelligence
14© IntelliMagic 2014
Time End-user impact
Response Time
Your existing monitors look at symptoms
here, only after users experience problems
Detection
SLA
Perfo
rman
ce
IT Infrastructure Availability Monitoring Today
Easy metric to get,
but is an effect, not a cause
15
Availability Intelligence identifies risk here, before
response time suffers
© IntelliMagic 2014
Time
Response Time
Sub-component SaturationSL
A Pe
rform
ance
Monitoring with Availability Intelligence
DetectionEnd-user impact
Requires evaluating every data point
with expert domain knowledge about every component
Easy metric to get, but is an effect,
not a cause
16© IntelliMagic 2014
Time
Response Time Sub-component Saturation
SLA
Perfo
rman
ce
Most infrastructure “fires” can be prevented by
intervening here
No end user impact
Changing the Outcome - Avoiding Disruptions
17
What: Foreknowledge about hidden threats to availability
Why: To better protect continuous availability at primary site by 1. Avoiding incidents (make more of them predictable) 2. Accelerating the resolution (reduce MTTR)
How: Use built-in expert domain knowledge in automatic analysis of the performance and configuration data
What is Availability Intelligence?
18
• It is not enough to only have:‒ Easier, nicer graphs, visualizations‒ Statistical analysis (as common w/ ITOA - IT Operations Analytics)
• Rather, understanding what the data means for risk requires:‒ HW component knowledge (as gained from performance modeling)‒ Good or Bad? and rate the risk of unavailability‒ How to derive new, meaningful metrics out of the raw data‒ Best practices to configure, manage infrastructure‒ How to visualize the risk and problems in the infrastructure
What Availability Intelligence Requires
19
Illuminating Threats Inside the Storage Arrays
Storage Array Response
Times
Within Array
Between Arrays
Imbalance?
Application Workloads
Config or Failure
Changes?Disk Device
Loads
FW Bypass, etc.
Back-end,Cache
AdapterUtilization
Fibre Switch Errors
Front-endLag
Measure:
Lead Measures:
Lead Measures:
20
3. Sample Use Cases
21
Data Center Rollups of KRI’s - Key Risk Indicators
21© IntelliMagic 2014
Disk Storage Systems
Performance Metrics
Key Risk Indicators
Highest Rating for this Dashboard
Consolidate individual ratings on infrastructure resources into data center views to see risk across enterprise at a glance
22
Visualizing Risk to Continuous Availability
What does the data mean for your infrastructure availability?Automatic rating of key metrics according to built-in expert knowledge, to obtain intelligence about threats you can use to protect availability
No Border, No Rating Green Border, GoodYellow Border, Early Warning
Red Border, Performance Exceptions
23
Rating the Risk using Expert Domain Knowledge
Based on straight thresholds where appropriate (like hardware limits)
Based on dynamic thresholds where the limits also depend on
workload characteristics
24
Disk Infrastructure Use Case: Avoiding disruption to production service levels
25
Disk Storage System Dashboard [rating: 0.49]Rating based on DSS data using DSS Thresholds
Response Time on first storage array is
rated green – no discernable problem
to end-users yet.
But a threat to availability exists in
an underlying metric (back-end disk drive read response rate)
26
Response Time (ms) [rating: 0.00]Rating based on DSS data using DSS Thresholds
Response time is a lag measure
But seeing it plotted against the dynamic
thresholds (grey backgrounds) is useful
to have an idea of what can be expected
for that type of workload on that particular array configuration
27
Breakdown of Response Time Components (ms)
Breakdown of response time into its components allows identification of the largest contributors
28
Disconnect (ms) [rating: 0.00]Rating based on DSS data using DSS Thresholds
Overall, Disconnect Time is not yet out of range for this array
29
Disconnect time components (ms)
Built-in knowledge enables a further
breakdown of disconnect time into
its components
30
Drive Read Response (ms) [rating: 0.49]Rating based on DSS data using Thresholds Specific to this DSS Configuration
What was identified on the exception report is a
deeper issue:
Back-end drives are starting to become
saturated.
With minimal workload growth, this will soon show up in response
time and impact production users
31
Cost comparison use case: Holistic Evaluation (CPU vs. IO)
32
Using and Delay components per Service Class(%) (top 20) for all Service Classes by Service Class
Faster job executionis required.
Question:
For the select service class(es),
is it cheaper to obtain the needed performance win
with upgraded CPU or storage?
33
Is it the time spent waiting on DASD already the
best in class, or is there room
for improvement?
0
0.5
1
1.5
2
2.5
3
3.5
4
0:30 0:45 1:00 1:15 1:30 1:45 2:00 2:15 2:30
ms
Average Response Time Components for Entire Subsystem
IOSQ Pending Connect Disconnect
Approx 65% of Time is Using/Waiting on DASD
34
Comparing Options for Run Time Improvement
CPU Using
CPU Delay
DASD Using
& Delay
Total Seconds
Run Time savings
Before 1196 1523 3915 6634 na
1. CPU Upgrade 416 265 3915 4596 15%
2.Storage Upgrade 1196 1523 1027 3746 44%
Results of Modeling:
1. upgrading CPU to best available
vs. 2. upgrading storage to next generation
35
4. How to Create Availability Intelligenceout of the RMF/SMF data
36
1. z/OS Systems‒ Processors, WLM, Coupling Facility,
XCF, Jobs/Datasets
2. z/OS Disk & Replication‒ Supports every Disk vendor and configuration‒ FICON, Replication, Jobs, Datasets, Storage groups, GDPS…
3. z/OS Tape/Virtual Tape‒ IBM TS7700, Oracle StorageTek VSM ‒ 2016: EMC DLm
IntelliMagic Vision for z/OS
37
How to generate Availability Intelligence: 7 areas to apply expert domain knowledge
Built-In Domain Knowledge and Expertise is Required for Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
38
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
AsessColle
ct
#1 - Collect data from different sources
• z/OS RMF/CMF Data• Specific SMF Data
‒ IBM GDPS‒ IBM XRC
• SAN Collector• Vendor APIs
Colle
ct
39
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
AsessColle
ct
#2 – Normalize
Validate, normalize and properly categorize collected data• Consolidate same data from
different sources• Enable different summarization
(e.g. by storage pool)
Normalize
40
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
AsessColle
ct
Normalize#3 – Enrich
Fill the gaps• Calculate component
utilization from workload data
Enrich
41
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
AssessColle
ct
Normalize#4 – AssessIs it good or bad?• Apply hardware and
workload knowledge• Are the metrics as
expected based on the used hardware and workload profile?
Rate
Assess
42
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
Assess
Rate
Colle
ct
Normalize
#5 – RateHow significant and risky is it?
• Rating is always based on two thresholds (warning, exception)
• Rating is based on‒ Knowledge of HW
components‒ Best practices
• Focused on lead measures• Avoid false positives• Avoid false negatives
Rate
43
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
Assess
RateRec
ommen
d
Colle
ct
Normalize#6 – Recomendations What to do next?• For the rated exceptions in
the entire environment include recommendations about what is likely going on and what to do…
Recom
mend
44
How to generate Availability Intelligence
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
Assess
RateRec
ommen
d
Colle
ct
Normalize#7 - Visualization
• Optimized presentation of results
• Rating always visible‒ By coloured frames
and bubbles • Web Interface• Automated reporting
based on rating result
45
Benefits1. Neutralize Threats2. Accelerate fixes
Sample actions: •Rebalance work•Fix lost redundancy•Isolate change•Correct error •Hardware upgrade
7 Key Areas to Apply Expert Knowledge to SMF/RMF
Machine generated data
Domain knowledge, expertise
+ Colle
ctAvailability IntelligenceAutomation
Enrich
Assess
RateRec
ommen
d
Colle
ct
Normalize
46
Automation & the Power of Always Knowing
• Identify risk for every interval, on every device, in every data center
• A “thousand pairs of eyes” is the only way to continually execute the ITIL v3 definition of the capacity management process: – ensuring…the IT Infrastructure is able to deliver agreed Service Level Targets in a cost
effective and timely manner…considers all Resources required to deliver the IT Service...
47
5. Availability Intelligence as a Service from IntelliMagic
48
• Creating the world’s best intelligence about performance and availability risk in your infrastructure
• 20+ year history of delivering solutions for deep infrastructure analysis
• Privately held, financially independent• Customer centric, responsive• Solutions used daily in some
of the world’s largest data centers
IntelliMagic
49
• Stays up to date on frequently updated hardware knowledge • Very quick time to results (~24 hours)• Okay for security - no PII in infrastructure measurement data• Easy dissemination of intelligence reports• Easy access to expert consultants
Availability Intelligence as a Service
50
Example US Mainframe SaaS Customers • Insurance
‒ One of the largest in the US• Banking/Financial
‒ One of the largest in the US• Shipping
‒ One of the largest in the US
51
Outsmart Unavailability with the world’s best intelligence about the current levels of risk hiding in your z/OS infrastructure.To see the difference, just send us the historical RMF/SMF data prior to your last service disruption.For questions/more details, contact:[email protected]
Conclusion
“Any sufficiently advanced technology
is indistinguishable from magic”
Arthur C. Clarke, 1962