Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | olivia-horn |
View: | 233 times |
Download: | 1 times |
© 2014 IBM Corporation
IBM Systems and Technology Group
GDPS/Active-Active Overview (1.4)
David PetersenDavid [email protected]@us.ibm.com
© 2014 IBM Corporation
IBM Systems and Technology Group
2
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.
The following are trademarks or registered trademarks of other companies.
* Registered trademarks of IBM Corporation
* All other products may be trademarks or registered trademarks of their respective companies.Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.
IBM*IBM (logo)*Ibm.com*AIX*DB2*DS6000DS8000Dynamic Infrastructure*
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.InfiniBand is a trademark and service mark of the InfiniBand Trade Association.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.
ESCON*FlashCopy*GDPS*HyperSwapIBM*IBM logo*Parallel Sysplex*POWER5
Redbooks*Sysplex Timer*System p*System z*Tivoli*z/OS*z/VM*
Trademarks
© 2014 IBM Corporation
IBM Systems and Technology Group
Agenda
Level set
Requirements
Concepts
Configurations
Sample Scenarios
Use Cases
Summary
3
© 2014 IBM Corporation
IBM Systems and Technology Group
4
Suite of GDPS service products to meet various business requirements for availability and disaster recovery
Continuous Availability of Data
within a Data Center
Single Data Center
Applications remain active
Continuous access to data in
the event of a storage outage
GDPS/PPRC HM
RPO=0[RTO secs]
for disk only
Disaster Recovery Extended Distance
Two Data Centers
Rapid Systems D/R w/ “seconds”
of data loss
Disaster Recoveryfor out of region
interruptions
GDPS/GM & GDPS/XRC
RPO secs, RTO<1h
CA Regionally and Disaster Recovery Extended Distance
Three Data Centers
High availability for site disasters
Disaster recovery for regional
disasters
GDPS/MGM & GDPS/MzGM
RPO=0,RTO mins/<1h
& RPO secs, RTO<1h
RPO – recovery point objective RTO – recovery time objective
Continuous Availability with
DR within Metropolitan Region
GDPS/PPRC
RPO=0RTO mins / RTO<1h
(<20km) (>20km)
Two Data Centers
Systems remain active
Multi-site workloads can withstand site
and/or storage failures
© 2014 IBM Corporation
IBM Systems and Technology Group
5
1. Identify clearing and settlement activities in support of critical financial markets
2. Determine appropriate recovery and resumption objectives for clearing and settlement activities in support of critical markets
– core clearing and settlement organizations should develop the capacity to recover and resume clearing and settlement activities within the business day on which the disruption occurs with the overall goal of achieving recovery and resumption within two hours after an event
3. Maintain sufficient geographically dispersed resources to meet recovery and resumption objectives
– Back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location
– The effectiveness of back-up arrangements in recovering from a wide-scale disruption should be confirmed through testing
4. Routinely use or test recovery and resumption arrangements – One of the lessons learned from September 11 is that testing of business recovery
arrangements should be expanded
Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System [Docket No. R-1128] (April 7, 2003)
© 2014 IBM Corporation
IBM Systems and Technology Group
6
How Much Interruption can your Business Tolerate?
Standby
Active/Active
Ensuring Business Continuity:
Disaster Recovery
– Restore business after an unplanned outage
High Availability
– Meet Service Availability objectives, e.g., 99.9% availability or 8.8 hours of down-time a year
Continuous Availability
– No downtime (planned or not)
What is the cost of 1 hour of downtimeduring core business hours ?
Enterprises that operate across time-zones no longer have any ‘off-hours’ window, Continuous Availability is required
© 2014 IBM Corporation
IBM Systems and Technology Group
7
… with enormous impact on the business Downtime costs can equal up to 16 percent of revenue 1
4 hours of downtime severely damaging for 32 percent of organizations 2
Data is growing at explosive rates – growing from 161EB in 2007 to 988EB in 2010 3
Some industries fine for downtime and inability to meet regulatory compliance Downtime ranges from 300–1,200 hours per year, depending on industry 1
1 Infonetics Research, The Costs of Enterprise Downtime: North American Vertical Markets 2005, Rob Dearborn and others, January 2005
2 Continuity Central, “Business Continuity Unwrapped,” 2006, http://www.continuitycentral.com/feature0358.htm
3 The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010, IDC white paper #206171, March 2007
August 18, 2013
Google total eclipse sees 40 percent drop in Internet traffic
August 18, 2013
Google total eclipse sees 40 percent drop in Internet traffic
August 22, 2013
Nasdaq: ‘Connectivity issue‘ led to three-hour shutdown
August 22, 2013
Nasdaq: ‘Connectivity issue‘ led to three-hour shutdown
July 20,,2013
DMV Computers Fail Statewide, PoliceCan’t Access Database
July 20,,2013
DMV Computers Fail Statewide, PoliceCan’t Access Database
April 16, 2013
American Airlines Grounds Flights Nationwide
April 16, 2013
American Airlines Grounds Flights Nationwide
Disruptions affect more than the bottom line…
Disruptions affect more than the bottom line …
© 2014 IBM Corporation
IBM Systems and Technology Group
8
Shift focus from failover model to near-continuous availability model (RTO near zero)
Access data from any site (unlimited distance between sites) Multi-sysplex, multi-platform solution
– “Recover my business rather than my platform technology”
Ensure successful recovery via automated processes (similar to GDPS technology today)
– Can be handled by less-skilled operators
Provide workload distribution between sites (route around failed sites, dynamically select sites based on ability of site to handle additional workload)
Provide application level granularity
– Some workloads may require immediate access from every site, other workloads may only need to update other sites every 24 hours (less critical data)
– Current solutions employ an all-or-nothing approach (complete disk mirroring, requiring extra network capacity)
Evolving customer requirements
© 2014 IBM Corporation
IBM Systems and Technology Group
9
From High Availability to Continuous Availability
GDPS/Active-Active is for mission critical workloads that have stringent recovery objectives that can not be achieved using existing GDPS solutions
– RTO approaching zero, measured in seconds for unplanned outages– RPO approaching zero, measured in seconds for unplanned outages– Non-disruptive site switch of workloads for planned outages– At any distance
Active-Active is NOT intended to substitute for local availability solution such as Parallel Sysplex
GDPS/PPRC GDPS/XRC or GDPS/GM GDPS/Active-Active
Failover Model Failover Model Near CA model
Recovery Time ≈ 2 min Recovery Time < 1 hour Recovery time < 1 minute
Distance < 20 km Unlimited distance Unlimited distance
© 2014 IBM Corporation
IBM Systems and Technology Group
10
There are multiple GDPS service products under the GDPS solution umbrella to meet various customer requirements for Availability and Disaster Recovery
Continuous Availability of Data
within a Data Center
Single Data Center
Applications remain active
Continuous access to data in
the event of a storage outage
GDPS/PPRC HM
RPO=0[RTO secs]
for disk only
Disaster Recovery Extended Distance
Two Data Centers
Rapid Systems D/R w/ “seconds”
of data loss
Disaster Recoveryfor out of region
interruptions
GDPS/GM & GDPS/XRC
RPO secs, RTO<1h
CA Regionally and Disaster Recovery Extended Distance
Three Data Centers
High availability for site disasters
Disaster recovery for regional
disasters
GDPS/MGM & GDPS/MzGM
RPO=0,RTO mins/<1h
& RPO secs, RTO<1h
CA, DR, & Cross-site Workload
Balancing Extended Distance
Two or more Active Data Centers
Automatic workload switch in seconds;
seconds of data loss
GDPS/Active-Active
RPO secs, RTO secs
RPO – recovery point objective RTO – recovery time objective
Continuous Availability with
DR within Metropolitan Region
GDPS/PPRC
RPO=0RTO mins / RTO<1h
(<20km) (>20km)
Two Data Centers
Systems remain active
Multi-site workloads can withstand site
and/or storage failures
© 2014 IBM Corporation
IBM Systems and Technology Group
11
WorkloadDistributor
Replication
Active/Active Sites concept
Connections
Two or more sites, separated by unlimited distances, running the same applications and having the same data to provide:
– Cross-site Workload Balancing
– Continuous Availability
– Disaster Recovery Data at geographically dispersed
sites kept in sync via replication
Monitoring spans the sites and now becomes an essential element of the
solution for site health checks, performance tuning, etc
Workloads are managed by a client and routed to one of many replicas, depending
upon workload weight and latency constraints; extends workload balancing to
SYSPLEXs across multiple sites
© 2014 IBM Corporation
IBM Systems and Technology Group
12
Active/Active Sites Configurations
Configurations1. Active/Standby – GA date 30th June 2011
2. Active/Query – GA date 31st October 2013
3. …
A configuration is specified on a workload basis
A workload is the aggregation of these components– Software: user written applications (eg: COBOL programs) and the middleware run
time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)
– Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (such as DB2 Tables, IMS Databases and VSAM Files)
– Network connectivity: one or more TCP/IP addresses & ports (eg: 10.10.10.1:80)
© 2014 IBM Corporation
IBM Systems and Technology Group
13
WorkloadDistributor
Application A, B activeApplication A, B standby
ReplicationIMS DB2 IMSDB2<<<<>>>> <<<<
Active/Standby configuration
queued
>>>>
This is a fundamental paradigm shift from a failover modelto a continuous availability model
BBBBAAAA
site2site1
Static RoutingAutomatic Failover
Static RoutingAutomatic Failover
Application A, B active
Connections
VSAM VSAM
© 2014 IBM Corporation
IBM Systems and Technology Group
14
Appl B (grey) is in active/query configuration
• using same data as Appl A but read only• active to both site1 & site2, but favor site1• Appl B query routing according to Appl A
latency policy
Appl B (grey) is in active/query configuration
• using same data as Appl A but read only• active to both site1 & site2, but favor site1• Appl B query routing according to Appl A
latency policy
Replication
MM
IMS DB2 IMSDB2<<<<<<<<
site2site1
<<<<<<<<
[A] latency>5; as “max latency” policy has been exceeded. route
all queries to site2
[A] latency<3; as latency is less than “reset latency” policy, route
more queries to site1
[A] latency=2; as latency is less than “max latency”, follow policy
to skew queries to site1
Connections
Active/Query configuration
BBBB BBBB BB
AAAAAA
Read-only or query connections to be routed to both sites,while update connections are routed only to the active site
WorkloadDistributor
Appl A (gold) is in active/standby configuration
• performing updates in active site [site2]
Appl A (gold) is in active/standby configuration
• performing updates in active site [site2]
VSAM VSAM
© 2014 IBM Corporation
IBM Systems and Technology Group
15
What is a GDPS/Active-Active environment?
Two Production Sysplex environments (also referred to as sites) in different locations
– One active, one standby – for each defined update workload, and potential query workload active in both sites
– Software-based replication between the two sysplexes/sites
• DB2, IMS and VSAM data is supported Two Controller Systems
– Primary/Backup
– Typically one in each of the production locations, but there is no requirement that they are co-located in this way
Workload balancing/routing switches
– Must be Server/Application State Protocol compliant (SASP)• RFC4678 describes SASP
– What switches/routers are SASP-compliant? … the following are those we know about
• Cisco Catalyst 6500 Series Switch Content Switching Module
• F5 Big IP Switch
• Citrix NetScaler Appliance
• Radware Alteon Application Switch (bought Nortel appliance line)
© 2014 IBM Corporation
IBM Systems and Technology Group
16
AA Controller [AAC1]
Primary
AA Controller [AAC2]
Backup
Site 1 Site 2
A1 Prod-sys [A1P1]
wkld1 active
wkld2 standby
wkld3 active
A1 Prod-sys [A1P2]
wkld1 active
wkld3 active
A2 Prod-sys [A2P1]
wkld1 standby
wkld2 active
wkld3 standby
A2 Prod-sys [A2P2]
wkld1 standby
wkld3 standby
Sample scenario – both sites active for individual workloads
Sys
ple
x-A
1[A
AP
LE
X1
]S
ysp
lex-A2
[AA
PL
EX
2]
S/W ReplicationDB2 VSAM IMS DB2 VSAM IMS
Network
LB 2° TierSysplex DistribLB 2° Tier
Sysplex Distrib
LB 1° TierF5
LB 1° TierF5
LB 2° TierSysplex Distrib
LB 2° TierSysplex Distrib
SASP-compliantRouters
Routing forWKLD 1 & 3
Routing forWKLD 2
© 2014 IBM Corporation
IBM Systems and Technology Group
17
What S/W makes up a GDPS/Active-Active environment?
GDPS/Active-Active IBM Tivoli NetView Monitoring for GDPS, which pre-reqs:
– IBM Tivoli NetView for z/OS
• IBM Tivoli NetView for z/OS Enterprise Management Agent (NetView agent) – separate orderable
System Automation for z/OS IBM Multi-site Workload Lifeline for z/OS Middleware – DB2, IMS, CICS… Replication Software
– IBM InfoSphere Data Replication for DB2 for z/OS (IIDR for DB2)
– IBM InfoSphere Data Replication for IMS for z/OS (IIDR for IMS)
– IBM InfoSphere Data Replication for VSAM for z/OS (IIDR for VSAM)
Optionally the Tivoli OMEGAMON XE monitoring products – Individually or part of a suite
Integration of a number of software products
© 2014 IBM Corporation
IBM Systems and Technology Group
18
All components of a Workload should be defined in SA* as – One or more Application Groups (APG)– Individual Applications (APL)
The Workload itself is defined as an Application Group
SA z/OS keeps track of the individual members of the Workload's APG and reports a “compound” status to the A/A Controller
* Note that although SA is required on all systems, you can be using an alternative automation product to manage your workloads.
Automation – deeper insight
LegendDDS: Default Desired Status
HP: HasParent
OnDemand: Resource is UNAVAILABLE at IPL time
Asis: Resource is kept in the state it is at IPL time
MYWORKLOAD/APG
DDS=OnDemand | Asis
HP
HP
HP
CICS/APG
CICSTOR
CICSAOR CAPTURE
APPLY
DB2
TCP/IP
HP
© 2014 IBM Corporation
IBM Systems and Technology Group
19
Certain components of a Workload, for instance DB2, could be also viewed as “infrastructure”
Relationship(s) from the Workload ensure that the supporting “infrastructure” resources are available when needed
Infrastructure is typically started at IPL time
Automation – sharing components between workloads
MYWORKLOAD_2/APG
HP
HP
HP
CICS/APG
CICSTOR
CICSAOR CAPTURE
APPLY
DB2TCP/IP
JES
MQ
LifelineAgt
Infrastructure
HP
HP HP
HP
HP
© 2014 IBM Corporation
IBM Systems and Technology Group
20
Shared members
Other components of a Workload, for instance, capture and apply engines can also be shared
However, GDPS requires that they are members of the Workload
Rationale
The A/A Controller needs to know the capture and apply engines that belong to a Workload in order to
– Quiesce work properly including replication
– Send commands to them
Automation – sharing components between workloads
MYWKL_31/APG
HP HP
CAPTURE APPLY
DB2TCP/IP
JES
MQ
LifelineAgt
Infrastructure
HP HP
HP
MYWKL_32/APG
CICS31/APG
CICS32/APG
© 2014 IBM Corporation
IBM Systems and Technology Group
21
SOURCE3
Software replication – Deeper Insight
Log
SOURCE2
SOURCE1
Capture Apply
TARGET3
TARGET2
TARGET1
1
2 3 4
5
Capture latency Network latency Apply latency
1. Transaction committed
2. Capture read the DB updates from the log
3. Capture sends the updates to Apply
4. Apply receives the updates from Capture
5. Apply applies the DB updates to the target databases
Replication latency (E2E)
© 2014 IBM Corporation
IBM Systems and Technology Group
22
1st Tier LB
Connectivity – deeper insight
2
SYSPLEX 1
SYSPLEX 2
Primary ControllerLifeline AdvisorNetView
SE
TCP/IP LifelineAgent
ServerApplications
TCP/IP LifelineAgent
2nd Tier LB
Secondary ControllerLifeline AdvisorNetView
TCP/IP LifelineAgent
ServerApplications
TCP/IP LifelineAgent
ServerApplications
2nd Tier LB SE
2
2
3
4
4
1
1
5
site-1
Application / Database Tier
Application / Database Tier
SYS-D
SYS-C
SYS-B
SYS-A
1. Advisor to Agent
2. Advisor to LBs
3. Advisor to Advisor
4. Advisor to SEs
5. Advisor NMI
1
2
3
4
5
site-2
ServerApplications
© 2014 IBM Corporation
IBM Systems and Technology Group
23
A1 Production 1 A1 Production 2
LLAgent LLAgent
MQ / TCPIP MQ / TCPIP
Workload 1Active
Workload 3 Active
Workload 1 Active
Workload 3 Active
DB2 Rep IMS Rep DB2 Rep IMS Rep
CICS/DB2 Appl
IMS ApplCICS/DB2
ApplIMS Appl
A2 Production 2 A2 Production 1
LLAgent LLAgent
MQ / TCPIP MQ / TCPIP
Workload 1Standby
Workload 3 Standby
Workload 1 Standby
Workload 3 Standby
DB2 Rep IMS Rep DB2 Rep IMS Rep
CICS/DB2 Appl
IMS ApplCICS/DB2
ApplIMS Appl
GDPS/A-A configuration
Backup Controller
Netview Backup
LLAdvisor Secondary
TEMALB 2° Tier
Sysplex DistribLB 2° Tier
Sysplex Distrib
LB 2° TierSysplex DistribLB 2° Tier
Sysplex Distrib
Site 1 Site 2DB2 IMS DB2 IMS
A1P2A1P1
AAC2
A2P1A2P2
S/W Replication
LB 1° TierF5
LB 1° TierF5
Network
GDPS Web InterfaceTEP Interface
Primary Controller
Netview Master
LLAdvisor Primary
TEMA
AAC1
VSAM replication resources not shown for clarity sake
© 2014 IBM Corporation
IBM Systems and Technology Group
24
GDPS Web Interface
Initiate[click here]
SSSS
TEP Interface
<<<< >>>>
CICS/DB2Appl
[DB2 Rep]
WKLD2
WKLD3
AA
PL
EX
1
A1P1
CICS/DB2Appl
DB2 Rep
WKLD2
A1P2
WKLD1 WKLD1
standby standbyactive active
CICS/DB2Appl
WKLD1CICS/DB2
Appl
WKLD1
WKLD2
[DB2 Rep] DB2 Rep
WKLD2
WKLD3
AA
PL
EX
2
A2P2 A2P1
DB2VSAMIMS DB2 VSAM IMS>>>>
Site1 Site2Note: multiple workloads and needed infrastructure resources are not shown for clarity sake
>>>> <<<<
active activestandby standby
LB 1° TierF5
LB 1° TierF5
LB 2° TierSysplex Distrib
LB 2° TierSysplex Distrib
LB 2° TierSysplex DistribLB 2° Tier
Sysplex Distrib
Planned workload switch
AAC1
primary
AAC2
backup
SWITCH ROUTING
Network
© 2014 IBM Corporation
IBM Systems and Technology Group
25
<<<<
CICS/DB2Appl
WKLD1
[DB2 Rep]
WKLD2
WKLD3
AA
PL
EX
1
A1P1
CICS/DB2Appl
WKLD1
DB2 Rep
WKLD2
A1P2
DB2VSAMIMS
CICS/DB2Appl
WKLD1CICS/DB2
Appl
WKLD1
WKLD2
[DB2 Rep] DB2 Rep
WKLD2
WKLD3
A2P2 A2P1
active active
DB2 VSAM IMS
Site2
Failure Detection Interval = 60 sec
SITE_FAILURE = Automatic
AA
PL
EX
2
LB 2° TierSysplex Distrib
LB 2° TierSysplex Distrib
LB 2° TierSysplex DistribLB 2° Tier
Sysplex Distrib
LB 1° TierF5
LB 1° TierF5
Unplanned site failure
TEP Interface GDPS Web Interface
AAC1
primary
AAC2
backup
Site1Note: multiple workloads and needed infrastructure resources are not shown for clarity sake
Automatic switch
<<<<
active activestandby standby
>>>>
queued
STOP ROUTINGSTART ROUTING
Network
© 2014 IBM Corporation
IBM Systems and Technology Group
26
Go Home scenario
After an unplanned workload/site outage
Note: there is the potential for transactions to have been stranded in the failed site, had completed execution and committed data to the database at the time of the failure, but this data had not been replicated to the standby site.
Assume the data is still available on the disk subsystems
After a planned workload/site outage
Note: as the process to perform a planned site switch ensures that there are no stranded updates in the active site at the start of the switch, there is no need to start replication in the opposite direction in order to deliver stranded updates.
Start the site or workload that had failed Start the site or workload that had been stopped
Restart replication from the site brought back online to the currently active site - this delivers any stranded changes resulting from the unplanned outage (*)
Re-synchronize the recovering site with data from the currently active site, by starting replication in the other direction
Re-synchronize the restarted site or workload with data from the currently active site, by starting replication from the active to now standby site
Re-direct the workload, once the recovered site is operational and can process workloads
Re-direct the workload, once the restarted site is both operational and the data replication has caught up and can now process workloads
(*) attempts to apply the stranded changes to the data in the active site may result in an exception or conflict, as the before image of the update that is stranded will no longer match the updated value in the active site. For IMS replication, the adaptive apply process will discard the update and issue messages to indicate that there has been a conflict and an update has been discarded. For DB2 replication, the update may not be applied, depending on conflict handling policy settings, and additionally an exception record will be inserted into a table.
© 2014 IBM Corporation
IBM Systems and Technology Group
27
Disk Replication and Software Replication with GDPS
Active Sysplex A
Standby Sysplex B
DR Sysplex A
WorkloadDistributor
WorkloadDistributor
ConnectionsConnections
DB2, IMS, VSAM
System VolumesBatch, Other
disk replication
managed by GDPS/MGM
DB2, IMS, VSAM
Workload Switch – switch to SW copy (B); once problem is fixed, simply restart SW replicationSite Switch – switch to SW copy (B) and restart DR Sysplex A from the disk copy
Two switch decisions for Sysplex A problems …
RTO a few secondsSW replication for DB2/IMS/VSAM
RTO < 1 hourHW replication for all data in region
SW replication managed by GDPS/A-A
© 2014 IBM Corporation
IBM Systems and Technology Group
28
Disk Replication Integration
Provide DR for whole production sysplex (AA workloads & non-A/A workloads)
Restore A/A Sites capability for A/A Sites workloads after a planned or unplanned region switch
Restart batch workloads after the prime site is restarted and re-synced
The disk replication integration is optional
SW replication for IMS/DB2 and/or VSAM – RTO a few secondsHW replication for all data in region – RTO < 1 hour
© 2014 IBM Corporation
IBM Systems and Technology Group
29
Region B
Sysplex A
MM
MM
WorkloadDistributorWorkload
Distributor
ConnectionsConnections
SW replication for DB2, IMS and/or VSAM
Sysplex B
WKLD-1 standby
WKLD-2 standby
WKLD-3 active
WKLD-1 active
WKLD-2 active
WKLD-3-standby Sysplex B (System, Batch,
other)Sysplex A
(System, Batch, other)
Region A
GDPS Disk Replication Integration
High Availability in Region & DR Protection in other Region
Sysplex B’ Sysplex A’
HW replicationfor all data in region
GM
GM
© 2014 IBM Corporation
IBM Systems and Technology Group
30
Sysplex A’
1. Switch A/A workloads from Region A to Region B2. Recover Sysplex A secondary /tertiary disk3. Restart Sysplex A’ in Region BPotential manual tasks … (not automated by GDPS) 4. Start software replication from B to A’ using adaptive (force)
apply5. Start software replication from A’ to B with default (ignore) apply6. Manually reconcile exceptions from force (step 4)
Region B
Sysplex A WorkloadDistributorWorkload
Distributor
ConnectionsConnections
Sysplex B
Sysplex B (System, Batch,
other)
Region A
Unplanned Region Switch – Restart A/A & non-A/A workloads
1
2
3
4
5
HW replication(suspended)
Sysplex B’
MM
SW replication
Sysplex A (System, Batch,
other)
WKLD-1 active
WKLD-2 active
WKLD-3 active
© 2014 IBM Corporation
IBM Systems and Technology Group
31
Deployment of GDPS/Active-Active
Option 1 – create new sysplex environments for active/active workloads
– Simplifies operations as scope of Active/Active environment is confined to just this or these specific workloads and the Active/Active managed data
Option 2 – Active/Active workload and traditional workload co-exist within the same sysplex
– Still will need new active sysplex for the second site
– Increased complexity to manage recovery of Active/Active workload to one place, and remaining systems to a different environment, from within the same sysplex
– Existing GDPS/PPRC customer will have to implement GDPS co-operation support between GDPS/PPRC and GDPS/Active-Active
No single right answer – will depend on client environment and requirements/objectives
© 2014 IBM Corporation
IBM Systems and Technology Group
32
GDPS/A-A 1.4 New function summary
Active /Query configuration– Fulfills SoD made when the Active/Standby configuration was announced
VSAM Replication support– Adds to IMS and DB2 as the data types supported
– Requires either CICS TS V5 for CICS/VSAM applications or CICS VR V5 for logging of non-CICS workloads
Support for IIDR for DB2 (Qrep) Multiple Consistency Groups– Enables support for massive replication scalability
Workload switch automation– Avoids manual checking for replication updates having drained as part of
the switch process
GDPS/PPRC Co-operation support– Enables GDPS/PPRC and GDPS/A-A to coexist without issues over who
manages the systems
Disk replication integration – Provides tight integration with GDPS/MGM for GDPS/A-A to be able to
manage disaster recovery for the entire sysplex
© 2014 IBM Corporation
GDPS Active/Active Sites Customer Use Case – reducing planned outage
downtime by 90%
© 2014 IBM Corporation
IBM Systems and Technology Group
Customer Background
Large Chinese financial institution
Several critical workloads– Self-services (ATMs)– Internet banking– Internet banking (query-only)
Workloads access data from DB2 tables through CICS
Planned outages– Minor application upgrades (as needed)
• Often included DB2 table schema changes– Quarterly application version upgrades
• Other planned maintenance activities such as software infrastructure
34
© 2014 IBM Corporation
IBM Systems and Technology Group
Customer Goal - Seeking a better recovery time for planned outages
Critical workloads were down for three to four hours– Scheduled for 3rd shifts local time on weekends to limit impact to banking customers
• Still affected customers accessing accounts from other world-wide locations– Site taken down for application upgrades, possible database schema changes,
scheduled maintenance• All business stopped• Required manual coordination across geographic locations to block and resume
routing of connections into data center• Reload of DB2 data elongated outage period
Goal was to reduce planned outage time for these workloads down to minutes
35
© 2014 IBM Corporation
IBM Systems and Technology Group
Customer Solution – Leveraging continuous availability solution to provide better recovery time for planned outages
Solution provides– A transactional consistent copy of DB2 on a remote site
• IBM InfoSphere Data Replication for DB2 for z/OS (IIDR) - provides a high-performance replication solution for DB2
– A method to easily switch selected workloads to a remote site without any application changes
• IBM Multi-site Workload Lifeline (Lifeline) - facilitates planned outages by rerouting workloads from one site to another without disruption to users
– A centralized point of control to manage the graceful switch• GDPS Active/Active Sites - coordinates interactions between IIDR and Lifeline to
enable a non-disruptive switch of workloads without loss of data
Reduced impact to their banking customers!– Total outage time for update workloads was reduced from 3-4 hours down to about 2
minutes– Total outage time for the query workload was reduced from 3-4 hours down to under 2
minutes
36
© 2014 IBM Corporation
IBM Systems and Technology Group
37
Summary
Manages availability at a workload level
Provides a central point of monitoring & control
Manages replication between sites
Provides the ability to perform a controlled workload site switch
Provides near-continuous data and systems availability and helps simplify disaster recovery with an automated, customized solution
Reduces recovery time and recovery point objectives – measured in seconds
Facilitates regulatory compliance management with a more effective business continuity plan
Simplifies system resource management
GDPS/Active-Active is the next generation of GDPS
© 2014 IBM Corporation
IBM Systems and Technology Group
38
Thank YouThank You
TakDanish
TakDanish
DankeGerman
DankeGerman
Dank uDutch
Dank uDutch
ObrigadoBrazilian
Portuguese
ObrigadoBrazilian
Portuguese
ขอบคุ�ณThai
ขอบคุ�ณThai
GrazieItalian
GrazieItalian
go raibh maith agatGaelic
go raibh maith agatGaelic
TrugarezBreton
TrugarezBreton
MerciFrench
MerciFrench
GraciasSpanish
GraciasSpanish
СпаcибоRussian
СпаcибоRussian
நன்றி� Tamil
நன்றி� Tamil
धन्यवा�द Hindi
धन्यवा�द Hindi
' شكرًاArabic
' شكرًاArabic
감사합니다 Korean
감사합니다 Korean
תודה רבהHebrew
תודה רבהHebrew
Tack så mycketSwedish
Tack så mycketSwedish
DankonEsperantoDankonEsperanto
ありがとうございます Japanese
ありがとうございます Japanese
谢谢 Chinese
谢谢 Chinese
děkuji Czech
děkuji Czech
Teşekkür ederimTeşekkür ederimTurkish
© 2014 IBM Corporation
Additional Charts
IBM Systems and Technology Group
© 2014 IBM Corporation
IBM Systems and Technology Group
40
Pre-requisite software matrix
Pre-requisite software [version/release level]GDPS
ControllerA-A
Systemsnon A-A Systems
Operating Systems
z/OS 1.13 or higher YES YES YES
Application Middleware
DB2 for z/OS V9 or higher NO YES wkld dependent as required
IMS V11 NO YES wkld dependent as required
Websphere MQ V7.0.1 NOMQ is only req‘d forDB2 data replication
as required
CICS Transaction Server for z/OS V5.1 NO YES 1) as required
CICS VSAM Recovery for z/OS V5.1 NO YES 1) as required 1) CICS TS and CICS VR are required when using VSAM replication for A-A workloads
Replication
InfoSphere Data Replication for DB2 for z/OS 10.2 and SPE
NO YES wkld dependent as required 2)
InfoSphere Data Replication for IMS for z/OS V11.1 NO YES wkld dependent as required 2)
InfoSphere Data Replication for VSAM for z/OS V11.1
NO YES wkld dependent as required 2)
2) Non-Active/Active systems & their workloads can, if required, use Replication Server instances, but not the same instances as the A-A workloads
© 2014 IBM Corporation
IBM Systems and Technology Group
41
Pre-requisite software matrix (cont)
Pre-requisite software [version/release level]GDPS
ControllerA-A
Systemsnon A-A Systems
Management and Monitoring
GDPS/A-A V1.4 YES YES 3) YES 3)
3) GDPS/A-A requires the installation of the GDPS satellite code in production systems where A-A workloads ru
IBM Tivoli NetView Monitoring for GDPS v6.2 4) YES YES YES 3)
4) IBM Tivoli NetView Monitoring for GDPS v6.2 requires IBM Tivoli NetView for z/OS V6.2. NetView Monitoring for GDPS and NetView for z/OS just GA‘ed v6.2.1 releases
IBM Tivoli Management Services for z/OS V6.3 Fixpack 1 or later
YES 5) YES 6) YES 6)
5) IBM Tivoli NetView Management Services for z/OS is required for the NetView for z/OS Enterprise Management Agent to monitor the A-A solution.6) IBM Tivoli NetView Management Services for z/OS is optionally required to run where the NetView for z/OS Enterprise Management Agent runs to monitor NetView itself or where OMEGAMON XE products are deployed.
IBM Tivoli Monitoring V6.3 Fix Pack 1 or later NO NO NO
Tivoli System Automation for z/OS V3.4 + SPE APARs YES YES YES
IBM Multi-site Workload Lifeline Version for z/OS 2.0
YES YES NO
Optional Monitoring Products
Additional products such as Tivoli OMEGAMON XE on z/OS, Tivoli OMEGAMON XE for DB2, and Tivoli OMEGAMON XE for IMS may optionally be deployed to provide specific monitoring of products that are part of the Active/Active sites solutionNote: Details of cross product dependencies are listed in the PSP information for GDPS/A-A
which can be found by selecting the Upgrade:GDPS and Subset:AAV1R4 at the following URL:http://www14.software.ibm.com/webapp/set2/psearch/search?domain=psp&new=y
© 2014 IBM Corporation
IBM Systems and Technology Group
42
Pre-requisite products
IBM Multi-site Workload Lifeline v2.0– Advisor – runs on the Controllers & provides information to the external load
balancers on where to send connections and information to GDPS on the health of the environment
• There is one primary and one secondary advisor – Agent – runs on all production images with active/active workloads defined
and provide information to the Lifeline Advisor on the health of that system IBM Tivoli NetView Monitoring for GDPS v6.2 or higher
– Runs on all systems and provides automation and monitoring functions. This new product pre-reqs IBM Tivoli NetView for z/OS at the same version/release. The NetView Enterprise Master runs on the Primary Controller
IBM Tivoli Monitoring v6.3 FP1– Can run on zLinux, or distributed servers – provides monitoring infrastructure
and portal plus alerting/situation management via Tivoli Enterprise Portal, Tivoli Enterprise Portal Server and Tivoli Enterprise Monitoring Server
– If running NetView Monitoring for GDPS v6.2.1 and NetView for z/OS v6.2.1, ITM v6.3 FP3 is required.
© 2014 IBM Corporation
IBM Systems and Technology Group
43
IBM InfoSphere Data Replication for DB2 for z/OS v10.2– Runs on production images where required to capture (active) and apply (standby) data
updates for DB2 data. Relies on MQ as the data transport mechanism (QREP) IBM InfoSphere Data Replication for IMS for z/OS v11.1
– Runs on production images where required to capture (active) and apply (standby) data updates for IMS data. Relies on TCPIP as the data transport mechanism
IBM Infosphere Data Replication for VSAM for z/OS v11.1– Runs on production images where required to capture (active) and apply (standby) data
updates for VSAM data. Relies on TCP/IP as data transport mechanism. Requires CICS TS or CICS VR
System Automation for z/OS v3.4 or higher– Runs on all images. Provides a number of critical functions:
• BCPii for GDPS• Remote communications capability to enable GDPS to manage sysplexes from
outside the sysplex• System Automation infrastructure for workload and server management
Optionally the OMEGAMON XE products can provide additional insight to underlying components for Active/Active Sites, such as z/OS, DB2, IMS, the network, and storage
– There are 2 “suite” offerings that include the OMEGAMON XE products (OMEGAMON Performance Management Suite and Service Management Suite for z/OS).
Pre-requisite products…
© 2014 IBM Corporation
IBM Systems and Technology Group
44
Terminology
Active/Active Sites
– This is the overall concept of the shift from a failover model to a continuous availability model.
– Often used to describe the overall solution, rather than any specific product within the solution.
GDPS/Active-Active
– The name of the GDPS product which provides, along with the other products that make up the solution, the capabilities mentioned in this presentation such as workload, replication and routing management and so on.
© 2014 IBM Corporation
IBM Systems and Technology Group
45
Two Types of Active/Active Workloads
Update Workloads
Currently only run in what is defined as an active/standby configuration
– performing updates to the data associated with the workload, and
– has a relationship with the data replication component
– not all transactions within this workload will necessarily be update transactions
Query Workloads
Run in what is defined as an active/query configuration
– must not perform any updates to the data associated with the workload
– allows the query workload to run, or could be said to be active, in both sites at the same time
– a query workload must be associated with an update workload
© 2014 IBM Corporation
IBM Systems and Technology Group
46
Multiple Consistency Groups (MCGs) – for DB2for ultra large scale replication needs
A Consistency Group (CG) corresponds to a set of DB2 tables for which the replication apply process maintains transactional consistency - by applying data-dependent transactions serially, and other transactions in parallel
Multiple Consistency Groups (MCGs) are primarily used to provide scalability
– if and when one CG (Single Consistency Group) cannot keep up with all transactions for one workload
– query workloads can tolerate data replicated with eventual consistency
Q Replication (V10.2.1) can coordinate the Apply programs across CGs to guarantee that a time-consistent point across all CGs can be established at the standby site, following a disaster or outage, before switching workloads to this standby side
GDPS operations on a workload controls and coordinates replication for all CGs that belong to this workload
– For example, 'STOP REPLICATION' for a workload, stops replication in a coordinated manner for all CGs (all queues and Capture/Apply programs)
– GDPS supports up to 20 consistency groups for each workload
© 2014 IBM Corporation
IBM Systems and Technology Group
47
Workload 1CG1
Capture2
Capture2
Capture1
Capture1
Multiple consistency groups (MCG) – deeper insight
SOURCE3
SOURCE2
SOURCE1
TARGET3
TARGET2
TARGET1
SOURCE6
SOURCE5
SOURCE4
Multiple channels for
throughput rate
Single CG meets the requirements
of majority of workloads
Apply1
Apply1
send queuereceive queue
TARGET6
TARGET5
TARGET4
Apply2
Apply2
Workload 2CG2
CG3
Note: Eventual Consistency is suitable for a large number of READONLY applications
MQ Manager 1 MQ Manager 3
MQ Manager 2 MQ Manager 4
Site-1 (DB2 sharing group)Network
Site-2 (DB2 sharing group)
Multiple consistency
groups
© 2014 IBM Corporation
IBM Systems and Technology Group
48
StandbyUpdate Workload
&Active
Query Workload
StandbyUpdate Workload
&Active
Query Workload
Active Update Workload
&Active
Query Workload
Active Update Workload
&Active
Query Workload
Conceptual view
ConnectionsConnections
Workload Routing to active sysplex
Control information passed between systems and workload distributor
ControllersControllers
S/W Data Replication
WorkloadDistribution
Workload Lifeline,Tivoli NetView,
System Automation, …
Any load balancer or workloaddistributor that supports the ServerApplication State Protocol (SASP)
© 2014 IBM Corporation
IBM Systems and Technology Group
49
High level architecture
GDPS/Active-Active
System z Hardware
z/OS
SA zOS
NetViewDB2
Replication
Lifeline
IMS
Replication
TCP/IPMQ Workload
MonitoringVSAM
Replication
© 2014 IBM Corporation
IBM Systems and Technology Group
50
AA Controller [AAC1]
Primary
AA Controller [AAC2]
Backup
Site 1 Site 2
A1 Prod-sys [A1P1]
wkld-1 active
wkld-2 active
wkld-3 active
A1 Prod-sys [A1P2]
wkld-1 active
wkld-3 active
A2 Prod-sys [A2P1]
wkld-1 standby
wkld-2 standby
wkld-3 standby
A2 Prod-sys [A2P2]
wkld-1 standby
wkld-3 standby
Network
LB 2° TierSysplex DistribLB 2° Tier
Sysplex Distrib
LB 1° TierF5
LB 1° TierF5
LB 2° TierSysplex Distrib
LB 2° TierSysplex Distrib
Sample scenario – all workloads active in one site
S/W ReplicationDB2 VSAM IMS DB2 VSAM IMS
SASP-compliantRouters
Routing forWKLD 1, 2 & 3
Sys
ple
x-A
1[A
AP
LE
X1
]S
ysp
lex-A2
[AA
PL
EX
2]
© 2014 IBM Corporation
IBM Systems and Technology Group
51
A workload is the aggregation of these components
Software: user written applications (eg: COBOL programs) and the middleware run time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)
– Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (eg: DB2 Tables, IMS Databases, VSAM Files)
– Network connectivity: one or more TCP/IP addresses and ports (eg: 10.10.10.1:80)
What is an Active/Active Workload?
© 2014 IBM Corporation
IBM Systems and Technology Group
52
In DB2 Replication, the mapping between a table at the source and a table at the target is called a subscription
– Example shows 2 subscriptions for tables T1 and T2
A subscription belongs to a QMap which defines the sendq that is used to send data for that subscription
– Example shows that both subscriptions are using the same QMap (SQ1)
In IMS Replication, a subscription is a combination of a source server and a target server
– The subscription is the object that is started/stopped by GDPS/A-A.– This corresponds to the QMap in Q Replication
Each IMS Replication subscription contains a list of replication mappings– There is one replication mapping for each IMS database being replicated– This corresponds to a subscription in Q Replication
Data – deeper insight
SENDQ [SQ1]
SOURCE [SUBS]
T1
T2
T1
T2
Target
TARGET [SUBS]
RECVQ [SQ1]
© 2014 IBM Corporation
IBM Systems and Technology Group
53
Architectural building blocks
Active Production
Lifeline Agent
z/OS
Workload
IMS/DB2/VSAM
Replication Capture
TCPIP MQ
SA
NetView
Other Automation Product
Standby ProductionWAN & SASP-compliant Routers
used for workload distribution
WAN & SASP-compliant Routers used for workload distribution
Lifeline Agent
z/OS
Workload
IMS/DB2/VSAM
Replication Apply
TCPIP MQ
SA
NetView
Other Automation Product
Primary Controller Backup Controller
Lifeline Advisor
NetView
SA & BCPii
GDPS/A-A
z/OS
Tivoli Monitoring
Lifeline Advisor
NetView
SA & BCPii
GDPS/A-A
z/OS
Tivoli Monitoring
SE/HMC LANSE/HMC LAN
© 2014 IBM Corporation
IBM Systems and Technology Group
54
Automation code is an extension on many of the techniques tried and tested in other GDPS products and with many client environments for management of their mainframe CA & DR requirements
Control code only runs on Controller systems
Workload management - start/stop components of a workload in a given Sysplex
Software Replication management - start/stop replication for a given workload between sites
Disk Replication management – ability to manipulate GDPS/MGM from GDPS/A-A
Routing management - start/stop routing of connections to a site
System and Server management - STOP (graceful shutdown) of a system, LOAD, RESET, ACTIVATE, DEACTIVATE the LPAR for a system, and capacity on demand actions such as CBU/OOCoD
Monitoring the environment and alerting for unexpected situations
Planned/Unplanned situation management and control - planned or unplanned site or workload switches; automatic actions such as automatic workload switch (policy dependent)
Powerful scripting capability for complex/compound scenario automation
GDPS/Active-Active (the product)