of 25
8/11/2019 RAC Operational Best Practices
1/25
RAC & ASM Best PracticesYou Probably Need More than just RAC
Kirk McGowan
Technical DirectorRAC PackOracle Server Technologies
Cluster and Parallel Storage Development
8/11/2019 RAC Operational Best Practices
2/25
AgendaOperational Best Practices (IT MGMT 101)
Background
Requirements
Why RAC Implementations Fail
Case Study
Criticality of IT Service Management (ITIL)
process
Best Practices
People, Process, AND Technology
8/11/2019 RAC Operational Best Practices
3/25
Why do people buy RAC?
Low cost scalability
Cost reduction, consolidation, infrastructure that
can grow with the business
High Availability
Growing expectations for uninterrupted service.
8/11/2019 RAC Operational Best Practices
4/25
Why do RAC Implementationsfail?
RAC, scale-out clustering is new technology
Insufficient budget and effort is put towards filling
the knowledge gap
HA is difficult to do, and cannot be done withtechnology alone
Operational processes and discipline are critical
success factors, but are not addressed
sufficiently
8/11/2019 RAC Operational Best Practices
5/25
Case Study
Based on true stories. Any resemblance, in
full or in part, to your own experiences is
intentional and expected.
Names have been changed to protect the
innocent
8/11/2019 RAC Operational Best Practices
6/25
Case Study
Background 8-12 months spent implementing 2 systemssomewhat
different architectures, very different workloads, identicaltech stacks
Oracle expertise (Development) engaged to help flattentech learning curve
Non-mission critical systems, but important elements of alarger enterprise re-architecture effort.
Many technology issues encountered across the stack, and
resolved over the 8-12 month implementationHw, OS, storage, network, rdbms,
cluster, and application
8/11/2019 RAC Operational Best Practices
7/25
Case Study
Situation New mission critical deployment using same technology
stack
Distinct architecture, applications development teams, and
operations teams Large staff turnover
Major escalation, post production
CIO: Oracle products do not meet our
business requirements RAC is unstable
DG doesnt handle the workload
JDBC connections dont failover
8/11/2019 RAC Operational Best Practices
8/25
Case Study
Operational Issues Requirements, aka SLOs were not defined
e.g. Claim of 20s failover time; application logic included 80sfailover time, cluster failure detection time alone set to 120s.
Inadequate test environments Problems encountered first in productionincluding the fact
that SLOs could not be met
Inadequate change control
Lessons learned in previous deployments were not applied tonew deploymentrediscovery of same problems
Some changes implemented in test, but never rolled intoproductionre-occuring problems (outages) in production
No process for confirming a change actually fixes the problemprior to implementing in production
8/11/2019 RAC Operational Best Practices
9/25
Case Study
More Operational Issues
Poor knowledge xfer between internal teams
Configuration recommendations, patches, fixes identified inprevious deployments were not communicated.
Evictions are a symptom, not the problem.
Inadequate system monitoring
OS level statistics (CPU, IO, memory) were not being captured.
Impossible to RCA on many problems without ability to correlatecluster / database symptoms with system level activity.
Inadequate Support procedures
Inconsistent data capture
No on-site vendor support consistent with criticality of system No operations manual
- Managing and responding to outages
- Responding and restoring service after outages
8/11/2019 RAC Operational Best Practices
10/25
Overview of OperationalProcess Requirements
What are ITIL Guidelines?
ITIL (the IT Infrastructure Library) is the most widely accepted
approach to IT service management in th e wo rld, ITILprovides a comprehensive and con sistent set of b est
pract ices for IT service management, promoting a qual ity
approach to achieving busin ess effect iveness and eff ic iency
in the use of information systems.
8/11/2019 RAC Operational Best Practices
11/25
IT Service Management
IT Service Management = Service Delivery
+ Service Support
Service Delivery: partially concerned with
setting up agreements and monitoring the
targets within these agreements.
Service Support: processes can be viewed
as delivering services as laid down inthese agreements.
8/11/2019 RAC Operational Best Practices
12/25
Provisioning of IT Service Mgmt
In all organizations, must be matched to current andrapidly changing business demands. The objective isto continually improve the quality of service, aligned tothe business requirements, cost-effectively. To meet
this objective, three areas need to be considered: People with the right skills, appropriate training and the
right service culture
Effective and efficient Service Management processes
Good IT Infrastructure in terms of tools and technology.
Unless People, Processes and Technology areconsidered and implemented appropriately within asteering framework, the objectives of ServiceManagement will not be realized.
8/11/2019 RAC Operational Best Practices
13/25
Service Delivery
Financial Management
Service Level Management Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4 Response time guidelines
SLAs
Capacity Management
IT Service Continuity Management
Availability Management
8/11/2019 RAC Operational Best Practices
14/25
Service Support
Incident Management Incident documentation & Reporting, incident handling,
escalation procedures
Problem Management RCAs, QA & Process improvement
Configuration Management Standard configs, gold images, CEMLIs
Change Management
Risk assessment, backout, sw maintenance, decommission Release Management
New deployments, upgrades, Emergency release,component release
8/11/2019 RAC Operational Best Practices
15/25
BP: Set & Manage Expectations
Why is this important? Expectations with RAC are different at the outset
HA is as much (if not moreso) about the processes andprocedures, than it is about the technology
No matter what technology stack you implement, on its own itis incapable of meeting stringent SLAs
Must communicate what the technology can ANDcant do
Must be clear on what else needs to be in place to
supplement the technology if HA businessrequirements are going to be met.
HA isnt cheap!
8/11/2019 RAC Operational Best Practices
16/25
BP: Clearly define SLOs Sufficiently granular
Cannot architect, design, OR manage a system without clearlyunderstanding the SLOs
24x7 is NOT an SLO
Define HA/recovery time objectives, throughput,response time, data loss, etc
Need to be established with an understanding of the cost ofdowntime for the system.
RTO and RPO are key availability metrics
Response time and throughput are key performance metrics
Must address different failure conditions Planned vs unplanned
Localized vs site-wide
Must be linked to the business requirements Response time and resolution time
Must be realistic
8/11/2019 RAC Operational Best Practices
17/25
Manage to the SLOs Definitions of problem severity levels
Documented targets for both incident response time, andresolution time, based on severity
Classification of applications w.r.t. business criticality
Establish SLA with business
Negotiated response and resolution times
Definition of metrics E.g. Application Availability shall be measured using the
following formula: Total Minutes In A Calendar Mon thmin us Unsch eduled Outage Minutes minus Scheduled
Outage Minutes in suc h month, div ided by Total Minutes
In A Calendar Month
Negotiated SLOs Effectively documents expectations between IT and business
Incident log: date, time, description, duration, resolution
8/11/2019 RAC Operational Best Practices
18/25
Example Resolution TimeMatrix
Severity 1 Priority 1 and 2 SRs < 1 hour
Severity 1 Priority 3 SRs < 13 Hours
Severity 2 Priority 1 SRs < 14 hours
Severity 2 SRs < 132 hrs
8/11/2019 RAC Operational Best Practices
19/25
8/11/2019 RAC Operational Best Practices
20/25
BP: TEST, TEST, TEST Testing is a shared responsibility
Functional, destructive, and stress testing
Test environments must be representative of production Both in terms of configuration, and capacity
Separate from Production
Building a test harness to mimic production workload is a necessary, butnon-trivial effort
Ideally, problems would never be encountered first inproduction
If they are, the first question should be: Why didnt we catch the problemin test?
Exceeding some threshold
Unique timing or race condition
What can we do so we catch this type of problem in the future?
Build a test case that can be reused as part of pre-productiontesting.
BP D fi d t d
8/11/2019 RAC Operational Best Practices
21/25
BP: Define, document, andadhere to Change ControlProcesses This amounts to self discipline
Applies to all changes at all levels of the tech stack Hw changes, configuration changes, patches and patchsets,
upgrades, and even significant changes in workload.
If no changes are introduced, system will reach a steady state,and function for ever.
A well designed system will be able to tolerate somefluctuations, and faults.
A well managed system will meet service levels If a problem (that was fixed) is encountered again elsewhere, it is
a change management process problem, not a technologyproblem. I.e. rediscovery should not happen.
Ensure fixes are applied across all nodes in a cluster, and allenvironments to which the fix applies.
8/11/2019 RAC Operational Best Practices
22/25
8/11/2019 RAC Operational Best Practices
23/25
BP: Monitor your system
Define key metrics and monitor them actively Establish a (performance) baseline
Learn how to use Oracle-provided tools RDA (+ RACDDT)
AWR/ADDM
Active Session History
OSWatcher
Coordinate monitoring and collection of OS level stats
as well as db-level stats Problems observed at one layer are often just symptoms of
problems that exist at a different layer
Dont jump to conclusions
8/11/2019 RAC Operational Best Practices
24/25
8/11/2019 RAC Operational Best Practices
25/25
Summary
Deficiencies in operational processes and procedures
are the root cause of the vast majority of escalations
Address these, you dramatically increase your chances of
a successful RAC deployment, and will save yourself a lot
of future pain
Additional areas of challenge
Configuration ManagementInitial Install and config,
standardized gold image deployment
Incident Management - Diagnosing cluster-relatedproblems