8/7/2019 HACMP-and-XDIntro
1/37
IBM System p5 and eServer p5
2006 IBM Corporation
Introduction to
High Availability Cluster Multi-Processing(HACMP)
and
HACMP Extended Distance(HACMP-XD)
Shawn Bodily
ATS HACMP Specialist
IBM
server
pSeries
IBM
8/7/2019 HACMP-and-XDIntro
2/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
2 2005 IBM Corporation
Although hardware is now very reliable, hardware
failures account for a small minority of system outages Several studies place the proportion between 20% and
45% Human error, software error and planned maintenance
cause the majority of service outages
8/7/2019 HACMP-and-XDIntro
3/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
3 2005 IBM Corporation
Downtime and poor performance are expensive both
financially and in terms of customer perceptions Overall downtime-costs average 3.6% of annual revenue.
Infonetics
Many studies estimate average cost of downtime at over$5,000/hour
Popular Web sites estimate cost of downtime at millions of dollars
A 22-hour crash in June, 2003 cost eBayan estimated $5M
Losses go beyond immediate sales
revenue To clients, availability equates to reliability
and trustworthiness
Internal application failures preventemployees from working
8/7/2019 HACMP-and-XDIntro
4/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
4 2005 IBM Corporation
HACMP - Proven Technology for Business
Mature product now in its 17th major release
Averaging 40,000 licenses sold world-wide annually
Built on a decade of IBM cluster leadership
HACMP allows you to create highly available environmentswith minimal hardware.
HACMP is scalable up to 32-nodes, allowing your cluster toadapt to the growing demands of your business.
The optional XD feature allows your clusters to spanunlimited geographic distances.
8/7/2019 HACMP-and-XDIntro
5/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
5 2005 IBM Corporation
HACMP Is NOT the right solution if:
Your environment is not secure
Network security is not in place
Change management procedures are not respected
You do not have trained administrator
Environment is prone to user fiddle faddle Application requires manual intervention
HACMP will never be an out-of-the-box
solution to availability. A certain degree
of skill will be always be required.
8/7/2019 HACMP-and-XDIntro
6/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
6 2005 IBM Corporation
Reducing both Planned and Unplanned downtime
Unplanned Outage System Failure
Hardware Operating System Crash Power Loss
User Error Component Failure
NIC SCSI/SAN Adapter Network Hub/Switch
SAN Switch Disk Failure (both O/S and application data)
Planned Outage Maintenance
System Hardware Change/Upgrade OS & Application Upgrades & Fixes
Testing Applied Fixes Failure scenarios for HA & DR
8/7/2019 HACMP-and-XDIntro
7/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
7 2005 IBM Corporation
HACMP protects against service outages by detecting
problems and quickly failing over to backup hardware Two nodes (A and B) Two networks
Private (internal) network
Public (shared) network
Shared disk
All data in shared storageavailable to both nodes
Critical applications
Database server
Web server
Dependent on DB Shared DiskShared Disk
PrivatePrivateNetworkNetwork
!IBMserve
r
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
8/7/2019 HACMP-and-XDIntro
8/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
8 2005 IBM Corporation
Example Failure #1: Node failure
Shared DiskShared Disk
PrivatePrivate
NetworkNetwork
Node A fails completely
Node B detects the lossof Node A
Node B starts up its own
instance of the Database.
Database is temporarilytaken-over by Node Buntil Node A is broughtback online
!IBMserve
r
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
8/7/2019 HACMP-and-XDIntro
9/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
9 2005 IBM Corporation
Example Failure #2: Loss of network connection
Node A loses a NIC
Because of NIC redundancy,
the service IP swaps locally
Operations continue normally
while problem is resolved
If total public network
connectivity was lost a
fallover could occur
Shared DiskShared Disk
PrivatePrivate
NetworkNetwork
!IBM
serve
r
pSeries
AA
IBM
server
pSeries
BB
Company Shared NetworkCompany Shared Network
Web SrvDatabase
8/7/2019 HACMP-and-XDIntro
10/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
10 2005 IBM Corporation
One to one
One to any
Any to anyAny to one
Failover possibilities
8/7/2019 HACMP-and-XDIntro
11/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
11 2005 IBM Corporation
Custom Resource Groups
Startup Preferences
Online On Home Node Only (cascading) - (OHNO) Online on First Available Node (rotating or cascading w/inactive takeover)
- (OFAN) Online On All Available Nodes (concurrent) - (OAAN) Startup Distribution
Fallover Preferences Fallover To Next Priority Node In The List - (FOHP) Fallover Using Dynamic Node Priority - (FDNP) Bring Offline (On Error Node Only) - (BOEN)
Fallback Preferences
Fallback To Higher Priority Node - (FBHP) Never Fallback - (NFB)
8/7/2019 HACMP-and-XDIntro
12/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
12 2005 IBM Corporation
Common Resources to make highly available
Service IP Address(es)
The IP Addresses that users/client apps will use for production This can be one or multiple addresses
Not limited to the number of interfaces when utilizing aliasing
Application (Server)
Application(s) desired to be controlled/protect by HACMP Many cases can be user provided start/stop script May take advantage of pre-packaged application Smart Assists.
Shared Storage Volume Groups Logical Volumes JFS NFS
8/7/2019 HACMP-and-XDIntro
13/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
13 2005 IBM Corporation
Additional Granular Options
Resource Group Dependencies Parent/Child Relationships
Great for Multi-Tier environments
Location Dependencies
Online on Same Node
All resource groups must be online on the same node
Online on Different Nodes All resource groups must be online on different nodes
Online on Same Site
All resource groups must be online on the same site
Define Resource Group Priorities (Different Node Dep.) Low
Intermediate
High
8/7/2019 HACMP-and-XDIntro
14/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
14 2005 IBM Corporation
Application Monitoring
HACMP can monitor applications in one of two ways:
Process Monitor determines the death of a process
Custom Monitor monitors health of the application using a monitormethod you provide
Decisions upon failure
Restart Can establish a number of restarts to restart locally. After aspecified restart count, if app continues to fail you can escalate to afallover.
Notifiy Send email notification
Fallover Move application and associated resource group to nextcandidate node.
Suspend/Resume Application Monitoring at anytime.
8/7/2019 HACMP-and-XDIntro
15/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
15 2005 IBM Corporation
DLPAR/CUoD configuration
Active Processors Inactive Processors
WebServer
OrderEn
try
HACMP
HACMP
ProductionDatabase Server
DLPAR/CUoD Server
(running applications on active processors)
Database
Server
Shared
Disk
HACMP on the primary machine detects the failure
Running in a partition on another server, HACMP grows the backuppartition, activates the required inactive processors and restartsapplication
HACMP
HACMP
8/7/2019 HACMP-and-XDIntro
16/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
16 2005 IBM Corporation
Recent HACMP releases greatly improve ease of use Enhancements include:
Configuration wizard for typical two-node cluster
Automatic detection and configuration of IP networks
Online Planning Worksheet guides you through configuration
Simplified Web-based interface for management and monitoring
Online Planning
Worksheets ForResource GroupsShown Here
8/7/2019 HACMP-and-XDIntro
17/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
17 2005 IBM Corporation
With HACMP V5.x, you can configure a cluster in just
five questions
1. What is the address of the backup node?
2. What is the name of the application?3. What script HACMP should use to start it?
4. What script HACMP should use to stop it?
5. What is the service IP label that clients will use to access
the application?
8/7/2019 HACMP-and-XDIntro
18/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
18 2005 IBM Corporation
8/7/2019 HACMP-and-XDIntro
19/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
19 2005 IBM Corporation
IBM S t 5 d S 5
8/7/2019 HACMP-and-XDIntro
20/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
20 2005 IBM Corporation
WebSMIT Overview Demo
IBM S t 5 d S 5
8/7/2019 HACMP-and-XDIntro
21/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
21 2005 IBM Corporation
HACMP Cluster Test Tool
The Cluster Test Tool reduces implementation costs by simplifyingvalidation of cluster functionality.
It reduces support costs by automating testing of an HACMP cluster
to ensure correct behavior in the event of a real cluster failure.
The Cluster Test Tool executes a test plan, which consists of a seriesof individual tests.
Tests are carried out in sequence and the results are analyzed by thetest tool.
Administrators may define a custom test plan or use the automatedtest procedure.
Test results and other important data are collected in the test tool'slog file.
IBM S t 5 d S 5
8/7/2019 HACMP-and-XDIntro
22/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
22 2005 IBM Corporation
New features make HACMP V5.X easier to use
and more flexible Automatic detection and correction of common cluster
configuration problems
Enhanced support for complex multi-tier applications,relationships and dependencies
Clusters can be configured with simple ASCII files
Parallel resource processing recovers applications faster
Simpler, more flexible configuration and management
New Smart-Assists simplify HACMP implementation inDB2, Oracle and WebSphere environments
Inexpensive option includes all three Smart-Assists
IBM S t 5 d S 5
8/7/2019 HACMP-and-XDIntro
23/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
23 2005 IBM Corporation
HACMP with Oracle 10g fallover Demo
(1) p52A(1) p505(1) HMCHACMP 5.4
AIX 5.3 TL5Oracle 10gDS4300LPARMon (http://www.alphaworks.ibm.com/tech/lparmon)
Swingbench (http://www.dominicgiles.com/swingbench.html)Web-based System Manager
The cluster shown, was actually created using the two-nodeconfiguration assistant within HACMP.
8/7/2019 HACMP-and-XDIntro
24/37
IBM System p5 and eServer p5
2006 IBM Corporation
HACMP Extended Distance(HACMP-XD)
IBM
server
pSeries
IBM
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
25/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
25 2005 IBM Corporation
HA/DR is a balance of recovery time requirements and cost
Do you really need HA or DR ?
What is the target recovery time ?
Minutes ? Hours ? Days ?
Costs associated with implementing andmaintaining an HA or DR solution
Redundant hardware Inter site networking
Operations staff
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
26/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
26 2005 IBM Corporation
Tiers of Disaster Recovery:Level Setting HACMP/XD
Recovery TimeTiers based on SHARE definitions
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Tier 4 - Batch/Online database shadowing & journaling,Point in Time disk copy (FlashCopy), TSM-DRM
Tier 3 - Electronic Vaulting, TSM**, Tape
Tier 2 - PTAM, Hot Site,TSM**
Value
*PTAM = Pickup Truck Access Method with Tape
**TSM = Tivoli Storage Manager*** = Geographically Dispersed Parallel Sysplex
Tier 7 - Highly automated, business wide, integrated solution (Example:GDPS/PPRC/VTS P2P, AIX HACMP/XD , OS/400 HABP....
Tier 6 - Storage mirroring (example: XRC,PPRC, VTS Peer to Peer)
Tier 5 - Software two site, two phase commit (transaction integrity)
Applications withLow tolerance to
outage
ApplicationsSomewhat Tolerant
to outage
Applications verytolerant to outage*Tier 1 - PTAM
Zero or near zero datarecreationZero or near zero data
minutes to hoursminutes to hours
data recreationdata recreation
up to 24 hoursup to 24 hours
data recreationdata recreation
24-48 hours24-48 hoursdata recreationdata recreation
Best D/R practice is to blend tiers of solutions in order to maximize application
coverage at lowest possible cost . One size, one technology, or one
methodology doesn't fit all applications.
HACMP /XDfits in here
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
27/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
27 2005 IBM Corporation
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
28/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
28 2005 IBM Corporation
HACMP Extended Distance (XD) is an optional
component for cross-site geographic disaster recovery
Backup systems may be physically separate from primary
operations for protection in the event of power failure, flood,earthquake etc.
The XD option provides a basket of disaster recoverycapabilities and integration points
XD provides multiple options:
IP-based data mirroring (GLVM, HAGEO) Support for hardware-based data mirroring (Metro-Mirror/PPRC)
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
29/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
29 2005 IBM Corporation
HACMP XD Extended Distance for Disaster Recovery
Data replication between sites ensures a copy of the data isavailable after a site wide disaster
Choice of Technology depends on distance, performancerequirements
Campus-wide use LVM Split Site Mirroring
S
A
N
LAN / MAN
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
30/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
30 2005 IBM Corporation
HACMP XD Extended Distance for Disaster Recovery
Metro wide use SVC or ESS/PPRC Mirroring
ServerA ServerB ServerC ServerD
Router Router
PPRC/Metro
Mirror
oreRCMF
Primary
ESS/DS
Secondary
ESS/DS
Production
Site
Recovery
Site
SVC Mirroring
SVC SVC
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
31/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
31 2005 IBM Corporation
HACMP XD Extended Distance for Disaster Recovery
Unlimited use GLVM Mirroring
Subset of disks are defined as Remote Physical Volumes or RPVs
copy 1 Mirror 2 copy 2copy 1 Mirror 2 copy 2
copy 1 Mirror 1 copy 2copy 1 Mirror 1 copy 2
RPV Driver
Replicates
data over
WAN
LVMMirroredVolumeGroup
Both sites always have a complete copy of all mirrors
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
32/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
32 2005 IBM Corporation
New HACMP Geographic Logical Volume Manager is
a reliable, easy-to-use data mirror and failovercapability
GLVM provides unlimited-distance IP-based data mirroring
Fully integrated with AIX 5L logical volume management
Easier to use than existing HAGEO solution
No need to define and manage separate state maps Long-term replacement for HAGEO
Automatically reverses direction of data replication on
failover
Supports all IBM TotalStorage products certified withbase HACMP
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
33/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
33 2005 IBM Corporation
HACMP XD HACMP automates the solution
HACMP integrates support for all the replication options
Manages data replication direction, switching and resyncafter recovery
Recovers locally or moves entire application to backup site
Common infrastructure supports all solutions
Choose the one that meets your performance and distance
requirements
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
34/37
IBM System p5 and eServer p5
2004 IBM Corporation
I
34 2005 IBM Corporation
Thank You
Questions?????
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
35/37
y p p
2004 IBM Corporation
I
35 2005 IBM Corporation
Backup Slides on Networking
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
36/37
y p p
2004 IBM Corporation
I
36 2005 IBM Corporation
Typical Local HACMP Clustering Configuration
A single network view on a common subnet.Multiple networks can be used.
switch
switch
en0
en1
en0
en1
10.70.10.x
IBM System p5 and eServer p5
8/7/2019 HACMP-and-XDIntro
37/37
y p p
I
HACMP Clustering Across Sites
Different subnets, routers connected to allow cross subnet communications
switch
switch
en0
en1
en0
en1
10.70.10.x
switch
switch
10.50.10.x
Router Router