Transform RMF and SMF into Availability Intelligence

1

Transform RMF/SMF intoAvailability Intelligence

Brent Phillips, Managing Director, AmericasJerry Street, Senior Performance Consultant

www.intellimagic.com

2

Today’s Agenda

1. z/OS infrastructure availability2. The “Availability Intelligence” concept3. Sample use cases4. How to create intelligence about threats to availability5. Availability Intelligence as a Service

3

Session Abstract Using Availability Intelligence to Better Protect the Production Site

• It is time for a new, more intelligent approach to interpreting the RMF & SMF data. One that provides a dramatically different result that you can easily verify on your own data.

• RMF & SMF produce the world’s richest source of machine-generated data about enterprise infrastructure performance and configuration. But even the best run shops are not able to use this data to avoid incidents causing unavailability.

• To outsmart unavailability, you have to automatically “crawl” through all the workload data every day at a very granular level. This data needs to be enriched and constantly evaluated against detailed expert knowledge about the infrastructure.

• Using the automatic application of expert domain knowledge to mine and interpret the data, you can the risk in your infrastructure to handle your peak workloads, and how that risk is changing over time. This new visibility gives you warning before your online monitors can even detect any disruption to service levels.

4

We are inspired by creating intelligence

that illuminates the risks hiding inside your IT infrastructure.

“Any sufficiently advanced technology is indistinguishable from magic”

Arthur C. Clarke, 1962

5

1. z/OS Infrastructure Availability

6

Availability on z/OS Systems• What does the “z” stand for?

“zero downtime”

• What is your availability?

• z/OS vs. end-user experience

7

z/OS Infrastructure Areas• Many components required:

‒ Processor, Memory, WLM Goals, etc.‒ Channels‒ Coupling Facility‒ XCF‒ FICON‒ Disk Storage‒ Replication / DR / GDPS‒ Tape / Virtual Tape Storage

8

Infrastructure Availability Today: Either Good or Bad

Full Engageds

Little

Panic Hard to focus

Stress level

BrainStatus

Available

9

The Missing Stage: About to Be Bad

Little

Healthy

Panic

Engaged

Hard to focus

Stress level

BrainStatus

Available

10

Seeing Threats to Continuous Availability• Question: Which has better intelligence to avoid outages:

‒ A 20 thousand Dollar automobile; or ‒ A SAN storage infrastructure costing millions of Dollars?

11

Predictable

Unpredictable

Incidents Leading to Application Unavailability

Response for Unpredictable:• Find the problem quicker• Accelerate the

problem fix

Response for Predictable:

• Avoid incident with proactive action

12

Increasing the Predictable Portion

Predictable

Unpredictable

What would be the impact on:1. Your IT staff?2. Your Employees?3. Your Customers?

13

2. Availability Intelligence

14© IntelliMagic 2014

Time End-user impact

Response Time

Your existing monitors look at symptoms

here, only after users experience problems

Detection

SLA

Perfo

rman

ce

IT Infrastructure Availability Monitoring Today

Easy metric to get,

but is an effect, not a cause

15

Availability Intelligence identifies risk here, before

response time suffers

© IntelliMagic 2014

Time

Response Time

Sub-component SaturationSL

A Pe

rform

ance

Monitoring with Availability Intelligence

DetectionEnd-user impact

Requires evaluating every data point

with expert domain knowledge about every component

Easy metric to get, but is an effect,

not a cause


Time

Response Time Sub-component Saturation

SLA

Perfo

rman

ce

Most infrastructure “fires” can be prevented by

intervening here

No end user impact

Changing the Outcome - Avoiding Disruptions

17

What: Foreknowledge about hidden threats to availability

Why: To better protect continuous availability at primary site by 1. Avoiding incidents (make more of them predictable) 2. Accelerating the resolution (reduce MTTR)

How: Use built-in expert domain knowledge in automatic analysis of the performance and configuration data

What is Availability Intelligence?

18

• It is not enough to only have:‒ Easier, nicer graphs, visualizations‒ Statistical analysis (as common w/ ITOA - IT Operations Analytics)

• Rather, understanding what the data means for risk requires:‒ HW component knowledge (as gained from performance modeling)‒ Good or Bad? and rate the risk of unavailability‒ How to derive new, meaningful metrics out of the raw data‒ Best practices to configure, manage infrastructure‒ How to visualize the risk and problems in the infrastructure

What Availability Intelligence Requires

19

Illuminating Threats Inside the Storage Arrays

Storage Array Response

Times

Within Array

Between Arrays

Imbalance?

Application Workloads

Config or Failure

Changes?Disk Device

Loads

FW Bypass, etc.

Back-end,Cache

AdapterUtilization

Fibre Switch Errors

Front-endLag

Measure:

Lead Measures:

Lead Measures:

20

3. Sample Use Cases

21

Data Center Rollups of KRI’s - Key Risk Indicators


Disk Storage Systems

Performance Metrics

Key Risk Indicators

Highest Rating for this Dashboard

Consolidate individual ratings on infrastructure resources into data center views to see risk across enterprise at a glance

22

Visualizing Risk to Continuous Availability

What does the data mean for your infrastructure availability?Automatic rating of key metrics according to built-in expert knowledge, to obtain intelligence about threats you can use to protect availability

No Border, No Rating Green Border, GoodYellow Border, Early Warning

Red Border, Performance Exceptions

23

Rating the Risk using Expert Domain Knowledge

Based on straight thresholds where appropriate (like hardware limits)

Based on dynamic thresholds where the limits also depend on

workload characteristics

24

Disk Infrastructure Use Case: Avoiding disruption to production service levels

25

Disk Storage System Dashboard [rating: 0.49]Rating based on DSS data using DSS Thresholds

Response Time on first storage array is

rated green – no discernable problem

to end-users yet.

But a threat to availability exists in

an underlying metric (back-end disk drive read response rate)

26

Response Time (ms) [rating: 0.00]Rating based on DSS data using DSS Thresholds

Response time is a lag measure

But seeing it plotted against the dynamic

thresholds (grey backgrounds) is useful

to have an idea of what can be expected

for that type of workload on that particular array configuration

27

Breakdown of Response Time Components (ms)

Breakdown of response time into its components allows identification of the largest contributors

28

Disconnect (ms) [rating: 0.00]Rating based on DSS data using DSS Thresholds

Overall, Disconnect Time is not yet out of range for this array

29

Disconnect time components (ms)

Built-in knowledge enables a further

breakdown of disconnect time into

its components

30

Drive Read Response (ms) [rating: 0.49]Rating based on DSS data using Thresholds Specific to this DSS Configuration

What was identified on the exception report is a

deeper issue:

Back-end drives are starting to become

saturated.

With minimal workload growth, this will soon show up in response

time and impact production users

31

Cost comparison use case: Holistic Evaluation (CPU vs. IO)

32

Using and Delay components per Service Class(%) (top 20) for all Service Classes by Service Class

Faster job executionis required.

Question:

For the select service class(es),

is it cheaper to obtain the needed performance win

with upgraded CPU or storage?

33

Is it the time spent waiting on DASD already the

best in class, or is there room

for improvement?

0

0.5

1

1.5

2

2.5

3

3.5

4

0:30 0:45 1:00 1:15 1:30 1:45 2:00 2:15 2:30

ms

Average Response Time Components for Entire Subsystem

IOSQ Pending Connect Disconnect

Approx 65% of Time is Using/Waiting on DASD

34

Comparing Options for Run Time Improvement

CPU Using

CPU Delay

DASD Using

& Delay

Total Seconds

Run Time savings

Before 1196 1523 3915 6634 na

1. CPU Upgrade 416 265 3915 4596 15%

2.Storage Upgrade 1196 1523 1027 3746 44%

Results of Modeling:

1. upgrading CPU to best available

vs. 2. upgrading storage to next generation

35

4. How to Create Availability Intelligenceout of the RMF/SMF data

36

1. z/OS Systems‒ Processors, WLM, Coupling Facility,

XCF, Jobs/Datasets

2. z/OS Disk & Replication‒ Supports every Disk vendor and configuration‒ FICON, Replication, Jobs, Datasets, Storage groups, GDPS…

3. z/OS Tape/Virtual Tape‒ IBM TS7700, Oracle StorageTek VSM ‒ 2016: EMC DLm

IntelliMagic Vision for z/OS

37

How to generate Availability Intelligence: 7 areas to apply expert domain knowledge

Built-In Domain Knowledge and Expertise is Required for Availability Intelligence

Machine generated data

Domain knowledge, expertise

+ Colle

ctAvailability IntelligenceAutomation

38

How to generate Availability Intelligence



+ Colle


AsessColle

ct

#1 - Collect data from different sources

• z/OS RMF/CMF Data• Specific SMF Data

‒ IBM GDPS‒ IBM XRC

• SAN Collector• Vendor APIs

Colle

ct

39




+ Colle


AsessColle

ct

#2 – Normalize

Validate, normalize and properly categorize collected data• Consolidate same data from

different sources• Enable different summarization

(e.g. by storage pool)

Normalize

40




+ Colle


Enrich

AsessColle

ct

Normalize#3 – Enrich

Fill the gaps• Calculate component

utilization from workload data

Enrich

41




+ Colle


Enrich

AssessColle

ct

Normalize#4 – AssessIs it good or bad?• Apply hardware and

workload knowledge• Are the metrics as

expected based on the used hardware and workload profile?

Rate

Assess

42




+ Colle


Enrich

Assess

Rate

Colle

ct

Normalize

#5 – RateHow significant and risky is it?

• Rating is always based on two thresholds (warning, exception)

• Rating is based on‒ Knowledge of HW

components‒ Best practices

• Focused on lead measures• Avoid false positives• Avoid false negatives

Rate

43




+ Colle


Enrich

Assess

RateRec

ommen

d

Colle

ct

Normalize#6 – Recomendations What to do next?• For the rated exceptions in

the entire environment include recommendations about what is likely going on and what to do…

Recom

mend

44




+ Colle


Enrich

Assess

RateRec

ommen

d

Colle

ct

Normalize#7 - Visualization

• Optimized presentation of results

• Rating always visible‒ By coloured frames

and bubbles • Web Interface• Automated reporting

based on rating result

45

Benefits1. Neutralize Threats2. Accelerate fixes

Sample actions: •Rebalance work•Fix lost redundancy•Isolate change•Correct error •Hardware upgrade

7 Key Areas to Apply Expert Knowledge to SMF/RMF



+ Colle


Enrich

Assess

RateRec

ommen

d

Colle

ct

Normalize

46

Automation & the Power of Always Knowing

• Identify risk for every interval, on every device, in every data center

• A “thousand pairs of eyes” is the only way to continually execute the ITIL v3 definition of the capacity management process: – ensuring…the IT Infrastructure is able to deliver agreed Service Level Targets in a cost

effective and timely manner…considers all Resources required to deliver the IT Service...

47

5. Availability Intelligence as a Service from IntelliMagic

48

• Creating the world’s best intelligence about performance and availability risk in your infrastructure

• 20+ year history of delivering solutions for deep infrastructure analysis

• Privately held, financially independent• Customer centric, responsive• Solutions used daily in some

of the world’s largest data centers

IntelliMagic

49

• Stays up to date on frequently updated hardware knowledge • Very quick time to results (~24 hours)• Okay for security - no PII in infrastructure measurement data• Easy dissemination of intelligence reports• Easy access to expert consultants

Availability Intelligence as a Service

50

Example US Mainframe SaaS Customers • Insurance

‒ One of the largest in the US• Banking/Financial

‒ One of the largest in the US• Shipping

‒ One of the largest in the US

51

Outsmart Unavailability with the world’s best intelligence about the current levels of risk hiding in your z/OS infrastructure.To see the difference, just send us the historical RMF/SMF data prior to your last service disruption.For questions/more details, contact:[email protected]

Conclusion

“Any sufficiently advanced technology

is indistinguishable from magic”

Arthur C. Clarke, 1962

mailto:[email protected]

Date post:	08-Feb-2017
Category:	Software
Upload:	intellimagic
View:	112 times
Download:	1 times

Transform RMF and SMF into Availability Intelligence

Software