© 2014 IBM Corporation IBM Systems and Technology Group David Petersen [email protected]...

© 2014 IBM Corporation

IBM Systems and Technology Group

GDPS/Active-Active Overview (1.4)

David PetersenDavid [email protected]@us.ibm.com



2

The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.

The following are trademarks or registered trademarks of other companies.

* Registered trademarks of IBM Corporation

* All other products may be trademarks or registered trademarks of their respective companies.Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.

IBM*IBM (logo)*Ibm.com*AIX*DB2*DS6000DS8000Dynamic Infrastructure*

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.InfiniBand is a trademark and service mark of the InfiniBand Trade Association.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.

ESCON*FlashCopy*GDPS*HyperSwapIBM*IBM logo*Parallel Sysplex*POWER5

Redbooks*Sysplex Timer*System p*System z*Tivoli*z/OS*z/VM*

Trademarks



Agenda

Level set

Requirements

Concepts

Configurations

Sample Scenarios

Use Cases

Summary

3



4

Suite of GDPS service products to meet various business requirements for availability and disaster recovery

Continuous Availability of Data

within a Data Center

Single Data Center

Applications remain active

Continuous access to data in

the event of a storage outage

GDPS/PPRC HM

RPO=0[RTO secs]

for disk only

Disaster Recovery Extended Distance

Two Data Centers

Rapid Systems D/R w/ “seconds”

of data loss

Disaster Recoveryfor out of region

interruptions

GDPS/GM & GDPS/XRC

RPO secs, RTO<1h

CA Regionally and Disaster Recovery Extended Distance

Three Data Centers

High availability for site disasters

Disaster recovery for regional

disasters

GDPS/MGM & GDPS/MzGM

RPO=0,RTO mins/<1h

& RPO secs, RTO<1h

RPO – recovery point objective RTO – recovery time objective

Continuous Availability with

DR within Metropolitan Region

GDPS/PPRC

RPO=0RTO mins / RTO<1h

(<20km) (>20km)

Two Data Centers

Systems remain active

Multi-site workloads can withstand site

and/or storage failures



5

1. Identify clearing and settlement activities in support of critical financial markets

2. Determine appropriate recovery and resumption objectives for clearing and settlement activities in support of critical markets

– core clearing and settlement organizations should develop the capacity to recover and resume clearing and settlement activities within the business day on which the disruption occurs with the overall goal of achieving recovery and resumption within two hours after an event

3. Maintain sufficient geographically dispersed resources to meet recovery and resumption objectives

– Back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location

– The effectiveness of back-up arrangements in recovering from a wide-scale disruption should be confirmed through testing

4. Routinely use or test recovery and resumption arrangements – One of the lessons learned from September 11 is that testing of business recovery

arrangements should be expanded

Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System [Docket No. R-1128] (April 7, 2003)



6

How Much Interruption can your Business Tolerate?

Standby

Active/Active

Ensuring Business Continuity:

Disaster Recovery

– Restore business after an unplanned outage

High Availability

– Meet Service Availability objectives, e.g., 99.9% availability or 8.8 hours of down-time a year

Continuous Availability

– No downtime (planned or not)

What is the cost of 1 hour of downtimeduring core business hours ?

Enterprises that operate across time-zones no longer have any ‘off-hours’ window, Continuous Availability is required



7

… with enormous impact on the business Downtime costs can equal up to 16 percent of revenue 1

4 hours of downtime severely damaging for 32 percent of organizations 2

Data is growing at explosive rates – growing from 161EB in 2007 to 988EB in 2010 3

Some industries fine for downtime and inability to meet regulatory compliance Downtime ranges from 300–1,200 hours per year, depending on industry 1

1 Infonetics Research, The Costs of Enterprise Downtime: North American Vertical Markets 2005, Rob Dearborn and others, January 2005

2 Continuity Central, “Business Continuity Unwrapped,” 2006, http://www.continuitycentral.com/feature0358.htm

3 The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010, IDC white paper #206171, March 2007

August 18, 2013

Google total eclipse sees 40 percent drop in Internet traffic

August 18, 2013

Google total eclipse sees 40 percent drop in Internet traffic

August 22, 2013

Nasdaq: ‘Connectivity issue‘ led to three-hour shutdown

August 22, 2013

Nasdaq: ‘Connectivity issue‘ led to three-hour shutdown

July 20,,2013

DMV Computers Fail Statewide, PoliceCan’t Access Database

July 20,,2013

DMV Computers Fail Statewide, PoliceCan’t Access Database

April 16, 2013

American Airlines Grounds Flights Nationwide

April 16, 2013

American Airlines Grounds Flights Nationwide

Disruptions affect more than the bottom line…

Disruptions affect more than the bottom line …



8

Shift focus from failover model to near-continuous availability model (RTO near zero)

Access data from any site (unlimited distance between sites) Multi-sysplex, multi-platform solution

– “Recover my business rather than my platform technology”

Ensure successful recovery via automated processes (similar to GDPS technology today)

– Can be handled by less-skilled operators

Provide workload distribution between sites (route around failed sites, dynamically select sites based on ability of site to handle additional workload)

Provide application level granularity

– Some workloads may require immediate access from every site, other workloads may only need to update other sites every 24 hours (less critical data)

– Current solutions employ an all-or-nothing approach (complete disk mirroring, requiring extra network capacity)

Evolving customer requirements



9

From High Availability to Continuous Availability

GDPS/Active-Active is for mission critical workloads that have stringent recovery objectives that can not be achieved using existing GDPS solutions

– RTO approaching zero, measured in seconds for unplanned outages– RPO approaching zero, measured in seconds for unplanned outages– Non-disruptive site switch of workloads for planned outages– At any distance

Active-Active is NOT intended to substitute for local availability solution such as Parallel Sysplex

GDPS/PPRC GDPS/XRC or GDPS/GM GDPS/Active-Active

Failover Model Failover Model Near CA model

Recovery Time ≈ 2 min Recovery Time < 1 hour Recovery time < 1 minute

Distance < 20 km Unlimited distance Unlimited distance



10

There are multiple GDPS service products under the GDPS solution umbrella to meet various customer requirements for Availability and Disaster Recovery

Continuous Availability of Data

within a Data Center

Single Data Center

Applications remain active

Continuous access to data in

the event of a storage outage

GDPS/PPRC HM

RPO=0[RTO secs]

for disk only

Disaster Recovery Extended Distance

Two Data Centers

Rapid Systems D/R w/ “seconds”

of data loss

Disaster Recoveryfor out of region

interruptions

GDPS/GM & GDPS/XRC

RPO secs, RTO<1h

CA Regionally and Disaster Recovery Extended Distance

Three Data Centers

High availability for site disasters

Disaster recovery for regional

disasters

GDPS/MGM & GDPS/MzGM

RPO=0,RTO mins/<1h

& RPO secs, RTO<1h

CA, DR, & Cross-site Workload

Balancing Extended Distance

Two or more Active Data Centers

Automatic workload switch in seconds;

seconds of data loss

GDPS/Active-Active

RPO secs, RTO secs

RPO – recovery point objective RTO – recovery time objective

Continuous Availability with

DR within Metropolitan Region

GDPS/PPRC

RPO=0RTO mins / RTO<1h

(<20km) (>20km)

Two Data Centers

Systems remain active

Multi-site workloads can withstand site

and/or storage failures



11

WorkloadDistributor

Replication

Active/Active Sites concept

Connections

Two or more sites, separated by unlimited distances, running the same applications and having the same data to provide:

– Cross-site Workload Balancing

– Continuous Availability

– Disaster Recovery Data at geographically dispersed

sites kept in sync via replication

Monitoring spans the sites and now becomes an essential element of the

solution for site health checks, performance tuning, etc

Workloads are managed by a client and routed to one of many replicas, depending

upon workload weight and latency constraints; extends workload balancing to

SYSPLEXs across multiple sites



12

Active/Active Sites Configurations

Configurations1. Active/Standby – GA date 30th June 2011

2. Active/Query – GA date 31st October 2013

3. …

A configuration is specified on a workload basis

A workload is the aggregation of these components– Software: user written applications (eg: COBOL programs) and the middleware run

time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)

– Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (such as DB2 Tables, IMS Databases and VSAM Files)

– Network connectivity: one or more TCP/IP addresses & ports (eg: 10.10.10.1:80)



13

WorkloadDistributor

Application A, B activeApplication A, B standby

ReplicationIMS DB2 IMSDB2<<<<>>>> <<<<

Active/Standby configuration

queued

>>>>

This is a fundamental paradigm shift from a failover modelto a continuous availability model

BBBBAAAA

site2site1

Static RoutingAutomatic Failover

Static RoutingAutomatic Failover

Application A, B active

Connections

VSAM VSAM



14

Appl B (grey) is in active/query configuration

• using same data as Appl A but read only• active to both site1 & site2, but favor site1• Appl B query routing according to Appl A

latency policy

Appl B (grey) is in active/query configuration

• using same data as Appl A but read only• active to both site1 & site2, but favor site1• Appl B query routing according to Appl A

latency policy

Replication

MM

IMS DB2 IMSDB2<<<<<<<<

site2site1

<<<<<<<<

[A] latency>5; as “max latency” policy has been exceeded. route

all queries to site2

[A] latency<3; as latency is less than “reset latency” policy, route

more queries to site1

[A] latency=2; as latency is less than “max latency”, follow policy

to skew queries to site1

Connections

Active/Query configuration

BBBB BBBB BB

AAAAAA

Read-only or query connections to be routed to both sites,while update connections are routed only to the active site

WorkloadDistributor

Appl A (gold) is in active/standby configuration

• performing updates in active site [site2]

Appl A (gold) is in active/standby configuration

• performing updates in active site [site2]

VSAM VSAM



15

What is a GDPS/Active-Active environment?

Two Production Sysplex environments (also referred to as sites) in different locations

– One active, one standby – for each defined update workload, and potential query workload active in both sites

– Software-based replication between the two sysplexes/sites

• DB2, IMS and VSAM data is supported Two Controller Systems

– Primary/Backup

– Typically one in each of the production locations, but there is no requirement that they are co-located in this way

Workload balancing/routing switches

– Must be Server/Application State Protocol compliant (SASP)• RFC4678 describes SASP

– What switches/routers are SASP-compliant? … the following are those we know about

• Cisco Catalyst 6500 Series Switch Content Switching Module

• F5 Big IP Switch

• Citrix NetScaler Appliance

• Radware Alteon Application Switch (bought Nortel appliance line)



16

AA Controller [AAC1]

Primary


Backup

Site 1 Site 2

A1 Prod-sys [A1P1]

wkld1 active

wkld2 standby

wkld3 active

A1 Prod-sys [A1P2]

wkld1 active

wkld3 active

A2 Prod-sys [A2P1]

wkld1 standby

wkld2 active

wkld3 standby

A2 Prod-sys [A2P2]

wkld1 standby

wkld3 standby

Sample scenario – both sites active for individual workloads

Sys

ple

x-A

1[A

AP

LE

X1

]S

ysp

lex-A2

[AA

PL

EX

2]

S/W ReplicationDB2 VSAM IMS DB2 VSAM IMS

Network

LB 2° TierSysplex DistribLB 2° Tier

Sysplex Distrib

LB 1° TierF5

LB 1° TierF5

LB 2° TierSysplex Distrib


SASP-compliantRouters

Routing forWKLD 1 & 3

Routing forWKLD 2



17

What S/W makes up a GDPS/Active-Active environment?

GDPS/Active-Active IBM Tivoli NetView Monitoring for GDPS, which pre-reqs:

– IBM Tivoli NetView for z/OS

• IBM Tivoli NetView for z/OS Enterprise Management Agent (NetView agent) – separate orderable

System Automation for z/OS IBM Multi-site Workload Lifeline for z/OS Middleware – DB2, IMS, CICS… Replication Software

– IBM InfoSphere Data Replication for DB2 for z/OS (IIDR for DB2)

– IBM InfoSphere Data Replication for IMS for z/OS (IIDR for IMS)

– IBM InfoSphere Data Replication for VSAM for z/OS (IIDR for VSAM)

Optionally the Tivoli OMEGAMON XE monitoring products – Individually or part of a suite

Integration of a number of software products



18

All components of a Workload should be defined in SA* as – One or more Application Groups (APG)– Individual Applications (APL)

The Workload itself is defined as an Application Group

SA z/OS keeps track of the individual members of the Workload's APG and reports a “compound” status to the A/A Controller

* Note that although SA is required on all systems, you can be using an alternative automation product to manage your workloads.

Automation – deeper insight

LegendDDS: Default Desired Status

HP: HasParent

OnDemand: Resource is UNAVAILABLE at IPL time

Asis: Resource is kept in the state it is at IPL time

MYWORKLOAD/APG

DDS=OnDemand | Asis

HP

HP

HP

CICS/APG

CICSTOR

CICSAOR CAPTURE

APPLY

DB2

TCP/IP

HP



19

Certain components of a Workload, for instance DB2, could be also viewed as “infrastructure”

Relationship(s) from the Workload ensure that the supporting “infrastructure” resources are available when needed

Infrastructure is typically started at IPL time

Automation – sharing components between workloads

MYWORKLOAD_2/APG

HP

HP

HP

CICS/APG

CICSTOR

CICSAOR CAPTURE

APPLY

DB2TCP/IP

JES

MQ

LifelineAgt

Infrastructure

HP

HP HP

HP

HP



20

Shared members

Other components of a Workload, for instance, capture and apply engines can also be shared

However, GDPS requires that they are members of the Workload

Rationale

The A/A Controller needs to know the capture and apply engines that belong to a Workload in order to

– Quiesce work properly including replication

– Send commands to them

Automation – sharing components between workloads

MYWKL_31/APG

HP HP

CAPTURE APPLY

DB2TCP/IP

JES

MQ

LifelineAgt

Infrastructure

HP HP

HP

MYWKL_32/APG

CICS31/APG

CICS32/APG



21

SOURCE3

Software replication – Deeper Insight

Log

SOURCE2

SOURCE1

Capture Apply

TARGET3

TARGET2

TARGET1

1

2 3 4

5

Capture latency Network latency Apply latency

1. Transaction committed

2. Capture read the DB updates from the log

3. Capture sends the updates to Apply

4. Apply receives the updates from Capture

5. Apply applies the DB updates to the target databases

Replication latency (E2E)



22

1st Tier LB

Connectivity – deeper insight

2

SYSPLEX 1

SYSPLEX 2

Primary ControllerLifeline AdvisorNetView

SE

TCP/IP LifelineAgent

ServerApplications


2nd Tier LB

Secondary ControllerLifeline AdvisorNetView


ServerApplications


ServerApplications

2nd Tier LB SE

2

2

3

4

4

1

1

5

site-1

Application / Database Tier

Application / Database Tier

SYS-D

SYS-C

SYS-B

SYS-A

1. Advisor to Agent

2. Advisor to LBs

3. Advisor to Advisor

4. Advisor to SEs

5. Advisor NMI

1

2

3

4

5

site-2

ServerApplications



23

A1 Production 1 A1 Production 2

LLAgent LLAgent

MQ / TCPIP MQ / TCPIP

Workload 1Active

Workload 3 Active

Workload 1 Active

Workload 3 Active

DB2 Rep IMS Rep DB2 Rep IMS Rep

CICS/DB2 Appl

IMS ApplCICS/DB2

ApplIMS Appl

A2 Production 2 A2 Production 1

LLAgent LLAgent

MQ / TCPIP MQ / TCPIP

Workload 1Standby

Workload 3 Standby

Workload 1 Standby

Workload 3 Standby

DB2 Rep IMS Rep DB2 Rep IMS Rep

CICS/DB2 Appl

IMS ApplCICS/DB2

ApplIMS Appl

GDPS/A-A configuration

Backup Controller

Netview Backup

LLAdvisor Secondary

TEMALB 2° Tier

Sysplex DistribLB 2° Tier

Sysplex Distrib


Sysplex Distrib

Site 1 Site 2DB2 IMS DB2 IMS

A1P2A1P1

AAC2

A2P1A2P2

S/W Replication

LB 1° TierF5

LB 1° TierF5

Network

GDPS Web InterfaceTEP Interface

Primary Controller

Netview Master

LLAdvisor Primary

TEMA

AAC1

VSAM replication resources not shown for clarity sake



24

GDPS Web Interface

Initiate[click here]

SSSS

TEP Interface

<<<< >>>>

CICS/DB2Appl

[DB2 Rep]

WKLD2

WKLD3

AA

PL

EX

1

A1P1

CICS/DB2Appl

DB2 Rep

WKLD2

A1P2

WKLD1 WKLD1

standby standbyactive active

CICS/DB2Appl

WKLD1CICS/DB2

Appl

WKLD1

WKLD2

[DB2 Rep] DB2 Rep

WKLD2

WKLD3

AA

PL

EX

2

A2P2 A2P1

DB2VSAMIMS DB2 VSAM IMS>>>>

Site1 Site2Note: multiple workloads and needed infrastructure resources are not shown for clarity sake

>>>> <<<<

active activestandby standby

LB 1° TierF5

LB 1° TierF5




Sysplex Distrib

Planned workload switch

AAC1

primary

AAC2

backup

SWITCH ROUTING

Network



25

<<<<

CICS/DB2Appl

WKLD1

[DB2 Rep]

WKLD2

WKLD3

AA

PL

EX

1

A1P1

CICS/DB2Appl

WKLD1

DB2 Rep

WKLD2

A1P2

DB2VSAMIMS

CICS/DB2Appl

WKLD1CICS/DB2

Appl

WKLD1

WKLD2

[DB2 Rep] DB2 Rep

WKLD2

WKLD3

A2P2 A2P1

active active

DB2 VSAM IMS

Site2

Failure Detection Interval = 60 sec

SITE_FAILURE = Automatic

AA

PL

EX

2




Sysplex Distrib

LB 1° TierF5

LB 1° TierF5

Unplanned site failure

TEP Interface GDPS Web Interface

AAC1

primary

AAC2

backup

Site1Note: multiple workloads and needed infrastructure resources are not shown for clarity sake

Automatic switch

<<<<

active activestandby standby

>>>>

queued

STOP ROUTINGSTART ROUTING

Network



26

Go Home scenario

After an unplanned workload/site outage

Note: there is the potential for transactions to have been stranded in the failed site, had completed execution and committed data to the database at the time of the failure, but this data had not been replicated to the standby site.

Assume the data is still available on the disk subsystems

After a planned workload/site outage

Note: as the process to perform a planned site switch ensures that there are no stranded updates in the active site at the start of the switch, there is no need to start replication in the opposite direction in order to deliver stranded updates.

Start the site or workload that had failed Start the site or workload that had been stopped

Restart replication from the site brought back online to the currently active site - this delivers any stranded changes resulting from the unplanned outage (*)

Re-synchronize the recovering site with data from the currently active site, by starting replication in the other direction

Re-synchronize the restarted site or workload with data from the currently active site, by starting replication from the active to now standby site

Re-direct the workload, once the recovered site is operational and can process workloads

Re-direct the workload, once the restarted site is both operational and the data replication has caught up and can now process workloads

(*) attempts to apply the stranded changes to the data in the active site may result in an exception or conflict, as the before image of the update that is stranded will no longer match the updated value in the active site. For IMS replication, the adaptive apply process will discard the update and issue messages to indicate that there has been a conflict and an update has been discarded. For DB2 replication, the update may not be applied, depending on conflict handling policy settings, and additionally an exception record will be inserted into a table.



27

Disk Replication and Software Replication with GDPS

Active Sysplex A

Standby Sysplex B

DR Sysplex A

WorkloadDistributor

WorkloadDistributor

ConnectionsConnections

DB2, IMS, VSAM

System VolumesBatch, Other

disk replication

managed by GDPS/MGM

DB2, IMS, VSAM

Workload Switch – switch to SW copy (B); once problem is fixed, simply restart SW replicationSite Switch – switch to SW copy (B) and restart DR Sysplex A from the disk copy

Two switch decisions for Sysplex A problems …

RTO a few secondsSW replication for DB2/IMS/VSAM

RTO < 1 hourHW replication for all data in region

SW replication managed by GDPS/A-A



28

Disk Replication Integration

Provide DR for whole production sysplex (AA workloads & non-A/A workloads)

Restore A/A Sites capability for A/A Sites workloads after a planned or unplanned region switch

Restart batch workloads after the prime site is restarted and re-synced

The disk replication integration is optional

SW replication for IMS/DB2 and/or VSAM – RTO a few secondsHW replication for all data in region – RTO < 1 hour



29

Region B

Sysplex A

MM

MM

WorkloadDistributorWorkload

Distributor


SW replication for DB2, IMS and/or VSAM

Sysplex B

WKLD-1 standby

WKLD-2 standby

WKLD-3 active

WKLD-1 active

WKLD-2 active

WKLD-3-standby Sysplex B (System, Batch,

other)Sysplex A

(System, Batch, other)

Region A

GDPS Disk Replication Integration

High Availability in Region & DR Protection in other Region

Sysplex B’ Sysplex A’

HW replicationfor all data in region

GM

GM



30

Sysplex A’

1. Switch A/A workloads from Region A to Region B2. Recover Sysplex A secondary /tertiary disk3. Restart Sysplex A’ in Region BPotential manual tasks … (not automated by GDPS) 4. Start software replication from B to A’ using adaptive (force)

apply5. Start software replication from A’ to B with default (ignore) apply6. Manually reconcile exceptions from force (step 4)

Region B

Sysplex A WorkloadDistributorWorkload

Distributor


Sysplex B

Sysplex B (System, Batch,

other)

Region A

Unplanned Region Switch – Restart A/A & non-A/A workloads

1

2

3

4

5

HW replication(suspended)

Sysplex B’

MM

SW replication

Sysplex A (System, Batch,

other)

WKLD-1 active

WKLD-2 active

WKLD-3 active



31

Deployment of GDPS/Active-Active

Option 1 – create new sysplex environments for active/active workloads

– Simplifies operations as scope of Active/Active environment is confined to just this or these specific workloads and the Active/Active managed data

Option 2 – Active/Active workload and traditional workload co-exist within the same sysplex

– Still will need new active sysplex for the second site

– Increased complexity to manage recovery of Active/Active workload to one place, and remaining systems to a different environment, from within the same sysplex

– Existing GDPS/PPRC customer will have to implement GDPS co-operation support between GDPS/PPRC and GDPS/Active-Active

No single right answer – will depend on client environment and requirements/objectives



32

GDPS/A-A 1.4 New function summary

Active /Query configuration– Fulfills SoD made when the Active/Standby configuration was announced

VSAM Replication support– Adds to IMS and DB2 as the data types supported

– Requires either CICS TS V5 for CICS/VSAM applications or CICS VR V5 for logging of non-CICS workloads

Support for IIDR for DB2 (Qrep) Multiple Consistency Groups– Enables support for massive replication scalability

Workload switch automation– Avoids manual checking for replication updates having drained as part of

the switch process

GDPS/PPRC Co-operation support– Enables GDPS/PPRC and GDPS/A-A to coexist without issues over who

manages the systems

Disk replication integration – Provides tight integration with GDPS/MGM for GDPS/A-A to be able to

manage disaster recovery for the entire sysplex


GDPS Active/Active Sites Customer Use Case – reducing planned outage

downtime by 90%



Customer Background

Large Chinese financial institution

Several critical workloads– Self-services (ATMs)– Internet banking– Internet banking (query-only)

Workloads access data from DB2 tables through CICS

Planned outages– Minor application upgrades (as needed)

• Often included DB2 table schema changes– Quarterly application version upgrades

• Other planned maintenance activities such as software infrastructure

34



Customer Goal - Seeking a better recovery time for planned outages

Critical workloads were down for three to four hours– Scheduled for 3rd shifts local time on weekends to limit impact to banking customers

• Still affected customers accessing accounts from other world-wide locations– Site taken down for application upgrades, possible database schema changes,

scheduled maintenance• All business stopped• Required manual coordination across geographic locations to block and resume

routing of connections into data center• Reload of DB2 data elongated outage period

Goal was to reduce planned outage time for these workloads down to minutes

35



Customer Solution – Leveraging continuous availability solution to provide better recovery time for planned outages

Solution provides– A transactional consistent copy of DB2 on a remote site

• IBM InfoSphere Data Replication for DB2 for z/OS (IIDR) - provides a high-performance replication solution for DB2

– A method to easily switch selected workloads to a remote site without any application changes

• IBM Multi-site Workload Lifeline (Lifeline) - facilitates planned outages by rerouting workloads from one site to another without disruption to users

– A centralized point of control to manage the graceful switch• GDPS Active/Active Sites - coordinates interactions between IIDR and Lifeline to

enable a non-disruptive switch of workloads without loss of data

Reduced impact to their banking customers!– Total outage time for update workloads was reduced from 3-4 hours down to about 2

minutes– Total outage time for the query workload was reduced from 3-4 hours down to under 2

minutes

36



37

Summary

Manages availability at a workload level

Provides a central point of monitoring & control

Manages replication between sites

Provides the ability to perform a controlled workload site switch

Provides near-continuous data and systems availability and helps simplify disaster recovery with an automated, customized solution

Reduces recovery time and recovery point objectives – measured in seconds

Facilitates regulatory compliance management with a more effective business continuity plan

Simplifies system resource management

GDPS/Active-Active is the next generation of GDPS



38

Thank YouThank You

TakDanish

TakDanish

DankeGerman

DankeGerman

Dank uDutch

Dank uDutch

ObrigadoBrazilian

Portuguese

ObrigadoBrazilian

Portuguese

ขอบคุ�ณThai

ขอบคุ�ณThai

GrazieItalian

GrazieItalian

go raibh maith agatGaelic

go raibh maith agatGaelic

TrugarezBreton

TrugarezBreton

MerciFrench

MerciFrench

GraciasSpanish

GraciasSpanish

СпаcибоRussian

СпаcибоRussian

நன்றி� Tamil

நன்றி� Tamil

धन्यवा�द Hindi

धन्यवा�द Hindi

' شكرًاArabic

' شكرًاArabic

감사합니다 Korean

감사합니다 Korean

תודה רבהHebrew

תודה רבהHebrew

Tack så mycketSwedish

Tack så mycketSwedish

DankonEsperantoDankonEsperanto

ありがとうございます Japanese

ありがとうございます Japanese

谢谢 Chinese

谢谢 Chinese

děkuji Czech

děkuji Czech

Teşekkür ederimTeşekkür ederimTurkish


Additional Charts




40

Pre-requisite software matrix

Pre-requisite software [version/release level]GDPS

ControllerA-A

Systemsnon A-A Systems

Operating Systems

z/OS 1.13 or higher YES YES YES

Application Middleware

DB2 for z/OS V9 or higher NO YES wkld dependent as required

IMS V11 NO YES wkld dependent as required

Websphere MQ V7.0.1 NOMQ is only req‘d forDB2 data replication

as required

CICS Transaction Server for z/OS V5.1 NO YES 1) as required

CICS VSAM Recovery for z/OS V5.1 NO YES 1) as required 1) CICS TS and CICS VR are required when using VSAM replication for A-A workloads

Replication

InfoSphere Data Replication for DB2 for z/OS 10.2 and SPE

NO YES wkld dependent as required 2)

InfoSphere Data Replication for IMS for z/OS V11.1 NO YES wkld dependent as required 2)

InfoSphere Data Replication for VSAM for z/OS V11.1

NO YES wkld dependent as required 2)

2) Non-Active/Active systems & their workloads can, if required, use Replication Server instances, but not the same instances as the A-A workloads



41

Pre-requisite software matrix (cont)

Pre-requisite software [version/release level]GDPS

ControllerA-A

Systemsnon A-A Systems

Management and Monitoring

GDPS/A-A V1.4 YES YES 3) YES 3)

3) GDPS/A-A requires the installation of the GDPS satellite code in production systems where A-A workloads ru

IBM Tivoli NetView Monitoring for GDPS v6.2 4) YES YES YES 3)

4) IBM Tivoli NetView Monitoring for GDPS v6.2 requires IBM Tivoli NetView for z/OS V6.2. NetView Monitoring for GDPS and NetView for z/OS just GA‘ed v6.2.1 releases

IBM Tivoli Management Services for z/OS V6.3 Fixpack 1 or later

YES 5) YES 6) YES 6)

5) IBM Tivoli NetView Management Services for z/OS is required for the NetView for z/OS Enterprise Management Agent to monitor the A-A solution.6) IBM Tivoli NetView Management Services for z/OS is optionally required to run where the NetView for z/OS Enterprise Management Agent runs to monitor NetView itself or where OMEGAMON XE products are deployed.

IBM Tivoli Monitoring V6.3 Fix Pack 1 or later NO NO NO

Tivoli System Automation for z/OS V3.4 + SPE APARs YES YES YES

IBM Multi-site Workload Lifeline Version for z/OS 2.0

YES YES NO

Optional Monitoring Products

Additional products such as Tivoli OMEGAMON XE on z/OS, Tivoli OMEGAMON XE for DB2, and Tivoli OMEGAMON XE for IMS may optionally be deployed to provide specific monitoring of products that are part of the Active/Active sites solutionNote: Details of cross product dependencies are listed in the PSP information for GDPS/A-A

which can be found by selecting the Upgrade:GDPS and Subset:AAV1R4 at the following URL:http://www14.software.ibm.com/webapp/set2/psearch/search?domain=psp&new=y



42

Pre-requisite products

IBM Multi-site Workload Lifeline v2.0– Advisor – runs on the Controllers & provides information to the external load

balancers on where to send connections and information to GDPS on the health of the environment

• There is one primary and one secondary advisor – Agent – runs on all production images with active/active workloads defined

and provide information to the Lifeline Advisor on the health of that system IBM Tivoli NetView Monitoring for GDPS v6.2 or higher

– Runs on all systems and provides automation and monitoring functions. This new product pre-reqs IBM Tivoli NetView for z/OS at the same version/release. The NetView Enterprise Master runs on the Primary Controller

IBM Tivoli Monitoring v6.3 FP1– Can run on zLinux, or distributed servers – provides monitoring infrastructure

and portal plus alerting/situation management via Tivoli Enterprise Portal, Tivoli Enterprise Portal Server and Tivoli Enterprise Monitoring Server

– If running NetView Monitoring for GDPS v6.2.1 and NetView for z/OS v6.2.1, ITM v6.3 FP3 is required.



43

IBM InfoSphere Data Replication for DB2 for z/OS v10.2– Runs on production images where required to capture (active) and apply (standby) data

updates for DB2 data. Relies on MQ as the data transport mechanism (QREP) IBM InfoSphere Data Replication for IMS for z/OS v11.1

– Runs on production images where required to capture (active) and apply (standby) data updates for IMS data. Relies on TCPIP as the data transport mechanism

IBM Infosphere Data Replication for VSAM for z/OS v11.1– Runs on production images where required to capture (active) and apply (standby) data

updates for VSAM data. Relies on TCP/IP as data transport mechanism. Requires CICS TS or CICS VR

System Automation for z/OS v3.4 or higher– Runs on all images. Provides a number of critical functions:

• BCPii for GDPS• Remote communications capability to enable GDPS to manage sysplexes from

outside the sysplex• System Automation infrastructure for workload and server management

Optionally the OMEGAMON XE products can provide additional insight to underlying components for Active/Active Sites, such as z/OS, DB2, IMS, the network, and storage

– There are 2 “suite” offerings that include the OMEGAMON XE products (OMEGAMON Performance Management Suite and Service Management Suite for z/OS).

Pre-requisite products…



44

Terminology

Active/Active Sites

– This is the overall concept of the shift from a failover model to a continuous availability model.

– Often used to describe the overall solution, rather than any specific product within the solution.

GDPS/Active-Active

– The name of the GDPS product which provides, along with the other products that make up the solution, the capabilities mentioned in this presentation such as workload, replication and routing management and so on.



45

Two Types of Active/Active Workloads

Update Workloads

Currently only run in what is defined as an active/standby configuration

– performing updates to the data associated with the workload, and

– has a relationship with the data replication component

– not all transactions within this workload will necessarily be update transactions

Query Workloads

Run in what is defined as an active/query configuration

– must not perform any updates to the data associated with the workload

– allows the query workload to run, or could be said to be active, in both sites at the same time

– a query workload must be associated with an update workload



46

Multiple Consistency Groups (MCGs) – for DB2for ultra large scale replication needs

A Consistency Group (CG) corresponds to a set of DB2 tables for which the replication apply process maintains transactional consistency - by applying data-dependent transactions serially, and other transactions in parallel

Multiple Consistency Groups (MCGs) are primarily used to provide scalability

– if and when one CG (Single Consistency Group) cannot keep up with all transactions for one workload

– query workloads can tolerate data replicated with eventual consistency

Q Replication (V10.2.1) can coordinate the Apply programs across CGs to guarantee that a time-consistent point across all CGs can be established at the standby site, following a disaster or outage, before switching workloads to this standby side

GDPS operations on a workload controls and coordinates replication for all CGs that belong to this workload

– For example, 'STOP REPLICATION' for a workload, stops replication in a coordinated manner for all CGs (all queues and Capture/Apply programs)

– GDPS supports up to 20 consistency groups for each workload



47

Workload 1CG1

Capture2

Capture2

Capture1

Capture1

Multiple consistency groups (MCG) – deeper insight

SOURCE3

SOURCE2

SOURCE1

TARGET3

TARGET2

TARGET1

SOURCE6

SOURCE5

SOURCE4

Multiple channels for

throughput rate

Single CG meets the requirements

of majority of workloads

Apply1

Apply1

send queuereceive queue

TARGET6

TARGET5

TARGET4

Apply2

Apply2

Workload 2CG2

CG3

Note: Eventual Consistency is suitable for a large number of READONLY applications

MQ Manager 1 MQ Manager 3

MQ Manager 2 MQ Manager 4

Site-1 (DB2 sharing group)Network

Site-2 (DB2 sharing group)

Multiple consistency

groups



48

StandbyUpdate Workload

&Active

Query Workload

StandbyUpdate Workload

&Active

Query Workload

Active Update Workload

&Active

Query Workload

Active Update Workload

&Active

Query Workload

Conceptual view


Workload Routing to active sysplex

Control information passed between systems and workload distributor

ControllersControllers

S/W Data Replication

WorkloadDistribution

Workload Lifeline,Tivoli NetView,

System Automation, …

Any load balancer or workloaddistributor that supports the ServerApplication State Protocol (SASP)



49

High level architecture

GDPS/Active-Active

System z Hardware

z/OS

SA zOS

NetViewDB2

Replication

Lifeline

IMS

Replication

TCP/IPMQ Workload

MonitoringVSAM

Replication



50


Primary


Backup

Site 1 Site 2

A1 Prod-sys [A1P1]

wkld-1 active

wkld-2 active

wkld-3 active

A1 Prod-sys [A1P2]

wkld-1 active

wkld-3 active

A2 Prod-sys [A2P1]

wkld-1 standby

wkld-2 standby

wkld-3 standby

A2 Prod-sys [A2P2]

wkld-1 standby

wkld-3 standby

Network


Sysplex Distrib

LB 1° TierF5

LB 1° TierF5



Sample scenario – all workloads active in one site

S/W ReplicationDB2 VSAM IMS DB2 VSAM IMS

SASP-compliantRouters

Routing forWKLD 1, 2 & 3

Sys

ple

x-A

1[A

AP

LE

X1

]S

ysp

lex-A2

[AA

PL

EX

2]



51

A workload is the aggregation of these components

Software: user written applications (eg: COBOL programs) and the middleware run time environment (eg: CICS regions, InfoSphere Replication Server instances and DB2 subsystems)

– Data: related set of objects that must preserve transactional consistency and optionally referential integrity constraints (eg: DB2 Tables, IMS Databases, VSAM Files)

– Network connectivity: one or more TCP/IP addresses and ports (eg: 10.10.10.1:80)

What is an Active/Active Workload?



52

In DB2 Replication, the mapping between a table at the source and a table at the target is called a subscription

– Example shows 2 subscriptions for tables T1 and T2

A subscription belongs to a QMap which defines the sendq that is used to send data for that subscription

– Example shows that both subscriptions are using the same QMap (SQ1)

In IMS Replication, a subscription is a combination of a source server and a target server

– The subscription is the object that is started/stopped by GDPS/A-A.– This corresponds to the QMap in Q Replication

Each IMS Replication subscription contains a list of replication mappings– There is one replication mapping for each IMS database being replicated– This corresponds to a subscription in Q Replication

Data – deeper insight

SENDQ [SQ1]

SOURCE [SUBS]

T1

T2

T1

T2

Target

TARGET [SUBS]

RECVQ [SQ1]



53

Architectural building blocks

Active Production

Lifeline Agent

z/OS

Workload

IMS/DB2/VSAM

Replication Capture

TCPIP MQ

SA

NetView

Other Automation Product

Standby ProductionWAN & SASP-compliant Routers

used for workload distribution

WAN & SASP-compliant Routers used for workload distribution

Lifeline Agent

z/OS

Workload

IMS/DB2/VSAM

Replication Apply

TCPIP MQ

SA

NetView

Other Automation Product

Primary Controller Backup Controller

Lifeline Advisor

NetView

SA & BCPii

GDPS/A-A

z/OS

Tivoli Monitoring

Lifeline Advisor

NetView

SA & BCPii

GDPS/A-A

z/OS

Tivoli Monitoring

SE/HMC LANSE/HMC LAN



54

Automation code is an extension on many of the techniques tried and tested in other GDPS products and with many client environments for management of their mainframe CA & DR requirements

Control code only runs on Controller systems

Workload management - start/stop components of a workload in a given Sysplex

Software Replication management - start/stop replication for a given workload between sites

Disk Replication management – ability to manipulate GDPS/MGM from GDPS/A-A

Routing management - start/stop routing of connections to a site

System and Server management - STOP (graceful shutdown) of a system, LOAD, RESET, ACTIVATE, DEACTIVATE the LPAR for a system, and capacity on demand actions such as CBU/OOCoD

Monitoring the environment and alerting for unexpected situations

Planned/Unplanned situation management and control - planned or unplanned site or workload switches; automatic actions such as automatic workload switch (policy dependent)

Powerful scripting capability for complex/compound scenario automation

GDPS/Active-Active (the product)

Date post:	11-Jan-2016
Category:	Documents
Upload:	olivia-horn
View:	233 times
Download:	1 times

© 2014 IBM Corporation IBM Systems and Technology Group David Petersen [email protected]...

Documents