Disaster Recovery 2.0

© 2009 VMware Inc. All rights reserved

Disaster Recovery 2.0 A paradigm shift in DR Architecture.

Iwan ‘e1’ Rahabok

Staff SE, Strategic Accounts

+65 9119-9226 | [email protected] | virtual-red-dot.blogspot.com | sg.linkedin.com/in/e1ang

VCAP-DCD, VCP5

mailto:[email protected]

Business Requirements

2

It is similar to Insurance. • It’s no longer acceptable to run business without DR protection.

The question is now about…• How do we cut the DR cost & complexity? People cost, technology cost, etc.

Protect the Business in the event of Disaster

Disaster did strike in Singapore

3

29 June 2004• Electricity Supply Interruption

• More than 300,000 homes were left in the dark

• About 30% of Singapore was affected. If both your Prod and DR datacenters were on this 30%....

• Caused by the disruption of natural gas supply from West Natuna, Indonesia. A valve at the gas receiving station operated by ConocoPhillips tripped. Natural gas supply was disrupted causing 5 units of the combined-cycle gas turbines (CCGT) at Tuas Power Station, Power Seraya Power Station and SembCorp Cogen to trip.

• Some of the CCGTs could not switch to diesel successfully. Investigation into the incident is in progress.

Other Similar Incidents• The first disruption in natural gas supply occurred on 5 Aug 2002 due to a tripping of a

valve in the gas receiving station which led to a power blackout.

Disaster Recovery (DR) >< Disaster Avoidance (DA)

DA requires that Disaster must be avoidable.• DA implies that there is Time to respond to an impending Disaster. The time window

must be large enough to evacuate all necessary system.

Once avoided, for all practical purpose, there is no more disaster.• There is no recovery required.

• There is no panic & chaos.

DA is about Preventing (no downtime). DR is about Recovering (already down)• 2 opposite context.

It is insufficient to have DA only.

DA does not protect the business when Disaster strikes.

Get DR in place first, then DA.

5

DR Context: It’s a Disaster, so…

It might strike when we’re not ready• E.g. IT team having offsite meeting, and next flight is 8 hours away.

• Key technical personnels are not around (e.g. sick or holiday)

We can’t assume Production is up.• There might be nothing for us to evacuate or migrate to DR site.

• Even if the servers are up, we might not even able to access it (e.g. network is down).

Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate.• Shutting down multi-tier apps are complex and take time when you have 100s…

We can't assume certain system will not be affected• DR Exercise should involve entire datacenter.

Assume the worst, and start from that point.

Singapore MAS Guidelines

6

MAS is very clear that DR means Disaster has happened as there is outage.

Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliant.

DR: Assumptions

7

A company wide DR Solution shall assume:• Production is down or not accessible.

Entire datacenter, not just some systems.

• Key personnels are not available Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin,

RHEL admin, etc. Intelligence should be built into the system to eliminate reliance on human expert.

• Manual Run Books are not 100% up to date Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is

prone to human error. It contains thousands of steps, written by multiple authors. Automation & virtualisation reduce this risk.

DR Principles

8

To Business Users, actual DR experience must be identical to the Dry Run they experience

• In panic or chaotic situation, users should deal with something they are trained with.

• This means Dry Run has to simulate Production (without shutting down Production)

Dry Run must be done regularly.

• This ensures:

New employees are covered.

Existing employees do not forget.

The procedures are not outdated (hence incorrect or damaging)

• Annual is too long a gap, especially if many users or departments are involved.

DR System must be a replica of Production System

• Testing with a system that is not identical to production deems the Dry Run invalid.

• Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid

Dry Run, as the DR System is not the Production system.

• System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage,

network, security that make up “an application from business point of view”.

Datacenter wide DR Solution: Technical Requirements

9

Fully Automated• Eliminate reliance on many key personnels.

• Eliminate outdated (hence misleading) manual runbooks.

Enable frequent Dry Run, with 0 impact to Production.• Production must not be shutdown, as this impacts the business.

Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure.

• No clashing with Production Hostnames and IP addresses.

• If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore.

Scalable to entire datacenter• 1000s servers

• Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.

10

DR 1.0 architecture (current thinking)

Typical DR 1.0 solution (at infrastructure layer) has the following properties:

Area Solution

Server • Data drive (LUN) is replicated.

• OS/App drive is not. So there are 2 copies: Production and DR. They have different host name, IP address. They can’t be the same, as having identical hostname/IP will result in conflict as the network spans across both datacenters.

• This means DR system is actually different to Production, even on actual DR. Production never fails to DR. Only the data gets mounted.

• Technically, this is not a “production recovery” solution, but a “Site 2 mounting Site 1 data” solution. IT has been telling Business that IT is recovering Production, while what IT does is actually running a different system, and the only thing used from Production is just the data.

Storage • Not integrated with the server. Practically 2 different solution, manually run by 2 different team, with a lot of manual coordination and unhappiness.

Network • Not aware of DR Test and Dry Run. It’s 1 network for all purpose.

• Firewall rules manually maintained on both sides.

11

DR 1.0 architecture: Limitations

Technically, it is not even a DR solution• We do not recover the Production System. We merely mount production Data on a

different System The only way for the System to be recovered is to do SAN boot on DR Site.

• Can’t prove to audit that DR = Production.

• Registry changes, config changes, etc are hard to track at OS and Application level.

Manual mapping of data drive to associated server on DR site.

Not a scalable solution as manual update don’t scale well to 1000s servers.

Heavy on scripting, which are not tested regularly.

DR Testing relies heavily on IT expertise.

DR Requirements: Summary

12

ID Requirements Description

R01 DR copy = Production copy.Dry Run = Actual Run.

This is to avoid an invalid Dry Run as the System Under Test itself are not the same.No changes allowed (e.g. IP address and Host name) as it means Dry Run >< real DR

R02 Identical User Experience From business users point of view, the entire Dry Run exercise must match real/actual DR experience.

R03 No impact on Production during Dry Run.

DR test should not require Production to be shutdown, as it becomes a real failover. A real failover can’t be done frequently as it impacts the business. Business will resist testing, making the DR Solution risky due to rare testing.

R04 Frequent Dry Run This is only possible if Production is not affected.

R05 No reliance on human experts An datacenter wide DR needs a lot of expert from disciplines, making it an expensive effort.Actual procedure should be simple. It should not recover from error state.

R06 Scalable to entire datacenter DR solution should scale to >1000s servers while maintaining RTO and simplicity.

R01: DR Copy = Production Copy

13

Solution: replicate System + Data, not just data drive (LUN).

• OS, Apps, settings, etc.

Implication of the solution:

• If Production network is not stretched, the server will be unreachable. Changing IP will break Application.

• If Production network is stretched, IP Address and Hostname will conflict with Production. Changing Hostname will definitely break Application. Stretched L2 network is not a full solution. Entire LAN isolation is the solution.

Solution: Entire Dry Run network must be isolated (bubble network)

• No conflict with Production, as it’s actually identical. It’s a shadow of Production LAN.

• All network services (AD, DNS, DHCP, Proxy) must exist in the Shadow Prod LAN.

Implication of the solution:

• For VM, this is easily done via vSphere and SRM

• For Physical Servers, they need to be connected to Dry Run LAN. Permanent connection simplifies and eliminate risk of accidental update to production.

R02: Identical User Experience

14

desktop.ABC Corp.com

Desktop-DRTest.ABC Corp.com

Production desktop pools

DR Test desktop pools

(on-demand)

VDI is a natural companion to DR as it makes the “front-end” experience seamless.

• Users use Virtual Desktop as their day to day desktop.

• VDI enables us to DR the desktop too.

During Dry Run

• Users connect to desktop.vmware.com for production and desktop-DR.vmware.com for Dry Run. Having 2 desktops mean the environment is completely isolated.

During actual Disaster

• Desktop-DR.vmware.com is renamed to desktop.vmware.com as the original desktop.vmware.com is down (affected by the same DR). Users connect to desktop.vmware.com, just like they do in their day to day, hence creating an identical experience.

R03: No impact on Production during Dry Run

15

To achieve the above, the DR Solution:• Cannot require Production be shutdown or stopped. It must be Business as Usual.

• Must be an independent, full copy altogether, no reliance on Production component. Network, security, AD, DNS, Load Balancer, etc.

R04: Frequent Dry Run

16

To achieve the above, the DR Solution cannot:• Be laborious or prone to human error. A fully automated solution address this.

• Touch production system or network. So it has to be an isolated environment. A Shadow Production LAN solves this.

• VMware SRM enables the automation component for VM.

You should have the full confidence that the Actual Fail Over will work. This can only be achieved if you can do

frequent dry run.

Solution: Dealing with Physical Servers

17

Singapore (Prod Site) Singapore (DR Site)

CRM-App-Server.vmware.com

10.10.10.20

CRM-DB-Server.vmware.com

10.10.10.30

CRM-Web-Server.vmware.com10.10.10.10

CRM-App-Server.vmware.com

10.10.10.20

CRM-DB-Server.vmware.com

10.10.10.30

CRM-Web-Server.vmware.com10.10.10.10

Shadow Production LAN

CRM-DB-Server-Test.vmware.com

20.20.20.30

Physical Servers: Dual boot option

18

Physical Server must be dual-boot (OS):- Normal Operation: Test/Dev environment (default boot)- Dry Run or DR: Shadow Production network

Shadow Production LAN (10.10.10.x)

LAN on Datacenter 2 (20.20.20.x)

This VM is a Jump Box.

Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs.

Physical Servers: Dual partition option

19

DR Partition





Test/Dev Partition

1 physical box

Production Networks

Typical Physical Network: it’s 1 network

20

Singapore (Prod Site)

Users Site

AD/DNS Non-AD DNS

Production VMs Production PMs

Singapore (DR Site) Country X (any site)

AD/DNS Non-AD DNS


AD/DNS Non-AD DNS


Users (from any country) can access any servers (physical or virtual) on any country as basically there is only 1 “network”. There is routing to connect various LAN.

In 1 “network”, we can’t have 2 machines with same host name or same IP.

Each LAN has its own network address. Hence changing of IP address is required when moving from Prod Site to DR Site.

ABC Corp operates in many countries in Asia, with Singapore being the HQ.

A system may consist of multiple servers from the more than 1 country.

DNS service for Windows is provided by MS AD.

DNS service for non Windows is provided by non MS AD.

Site 2 needs to have 2 distinct Network

21

DR Server





Test/Dev Server

Mode: Normal Operation or During Dry Run

22


Non Prod LAN (20.20.20.x)

Production LAN (10.10.10.x)

Datacenter: Site 1

Users Site

Datacenter: Site 2

x

Desktop LAN (30.30.30.x)

Jump box

Mode: Partial DR

23

Non Prod LAN (20.20.20.x)

Production LAN (10.10.10.x)

Datacenter: Site 1

Users Site

Datacenter: Site 2

Desktop LAN (30.30.30.x)

Summary: DR 2.0 and 1.0

24

ID DR 1.0 DR 2.0

R01 Does not meet- It uses 2 copies, which are manually sync

Meet

R02 Does not meet. - The DR system >< Production system.

Meet

R03 Does not meet.- Dry Run is done on another system, not production copy.

Meet

R04 Does not meet Meet

R05 Does not meet. - Resource intensive. Dual boot, script, etc.

Meet

R06 Does not meet. Meet

Works for Physical Server. Does not work well in Virtual Environment

VM fits much better than Physical server.Network must have Shadow Production LAN

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE

VIP Mapped to server IP:

10.10.10.10 => 10.20.20.31

Pre-Failover

25


Load Balancer

User 10.30.30.30Global DNS

Load Balancer

Prod Site

DNS Query: www.abc.comDNS Response: Virtual IP 1

10.10.10.10

HTTP GET: 10.10.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

10.10.10.0/24

10.20.20.0/24

192.168.10.0/24

10.20.20.0/24

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE


192.168.10.10 => 10.20.20.31

Post-Failover

26


Load Balancer


Load Balancer

Prod Site

DNS Query: www.abc.comDNS Response: Virtual IP 2

192.168.10.10

HTTP GET: 192.168.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

10.10.10.0/24

10.20.20.0/24

192.168.10.0/24

10.20.20.0/24

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE


192.168.10.10 => 10.20.20.31

DR Dry Run

27


Load Balancer


Load Balancer

Prod Site

DNS Query: www-dr-test.abc.comDNS Response: Virtual IP 2

192.168.10.10

HTTP GET: 192.168.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

DR Test VMs DR Test PMs

10.10.10.0/24

10.20.20.0/24

192.168.10.0/24

10.20.20.0/24

Making it Work

28

•Strict enforcement to have external users use VIP

•Strict enforcement to have peer vApp stacks use VIP

•DNS failover setting at global site load-balancer would have to be manual - Network admin needed to update www.abc.com on global site load-balancer to reflect VIP at secondary DC.

•Server load-balancer use only applicable for serving specific applications. Application support with load-balancers is vendor dependent, with varying depth of app support.

•Applications will need to support source NAT. Some applications have known issues when used in conjunction with NAT (eg FTP), however server load-balancers may be able to mitigate issues. Vendor dependent.

•Not running a stretched VLAN means VMs with strong systemic dependencies must be placed on the same site, possibly as a vApp. Communications between VMs across sites can only be done using VIP, where a specific function and pool of VMs must have already been configured.

DA

29

From the view of DR

DA & DR in virtual environment

DR and DA solution do not fit well together in vSphere 5• DA requires 1 vCenter

DA needs long distance migration, which don’t work across 2 vCenters.

• DR requires 2 vCenters. vCenter prevents the same VM to appear 2x in the same vCenter. We can’t assume vCenter on main site is recoverable.

There is confusion on DR + DA• You cannot have DA + DR on the same “system”. You need 3 instances.

1 primary 1 secondary for DR purpose 1 secondary for DA purpose.

• Next slide explains limitations of some DA solution for DR use case. This is not to criticise the DA solution, as it is a good solution for DA use case.

DA Solution: Stretched Cluster (+ Long Distance vMotion)

31

When actual DR strikes…• We can’t assume Production is up. Hence vMotion is not a solution.

• HA will kick in and boot all VMs. Orders will not be honoured.

Challenge of the above solution: How do we Test?• DR Solution must be tested regularly as per Requirement R04.

• The test must be identical from user point of view, as per Requirement R02.

• So the test will have to be like this: Cut replication, then mount the LUNs, then add VMs into VC, boot the VMs. But… we cannot mount the LUNs the same vCenter as they have the same signature! Even if

we can, we must know the exact placement of each VMs (which is complex). Even if we can, we cannot boot 2 VMs on the same vCenter! This means Production VMs must be down. This fails Requirement R03.

Conclusion:

Stretched Cluster does not even qualify as DR Solution as it can’t be tested & it’s 100% manual.

DA Solution: 2 Clusters in 1 VC (+ Long Distance vMotion)

32

This is a variant of Stretched Cluster. • It fixes the risk & complexity of Stretched Cluster. And no performance impact of

uncontrolled long distance vMotion.

When actual DR strikes…• We can’t assume Production is up. Hence vMotion is not a solution.

• HA will not even kick in as it’s separate cluster. In fact, VMs will be in error state, appearing italized in vCenters.

Challenge of the above solution: How do we Test?• All issues facing Stretched Cluster apply.

Conclusion:

2-Cluster is inferior to Stretched Cluster from DR point of view

Stretched Datacenter: View from the Network

33

Bro, can you add some design info on complexity of stretching the network (assume no virtualisation, all physical servers).

A lot of VMware folks don’t appreciate the complexity & implication (design, operational, performance, upgrade, troubleshooting) when a network is stretched across 2 physical datacenter (say they are 40 km apart)

Active/Active or Active/Passive

34

Which one makes sense?

Background

35

Active/Active Datacenter has many level of definition:

• Both DC are actively running workload, so one is not idle.

This means Site 2 can be running non Production workload, like Test/Dev and DR.

• Both DC are actively running Production workload

Build from previous, this means Site 2 must run Production workload.

• Both DC are actively running Production workload, with application-level failover.

Build from previous, the same App run on both side. But the instance on Site 2 is not serving

users. It’s waiting for an application-level failover.

This is typicaly done via geo-cluster solution.

• Both DC are actively running Production workload, with A/A aplication-level

Both Apps are running. Normally done via global Load Balancer.

No need to failover as each App is “complete”. It has the full data, and it does not need to tell

the other App when its data is updated. No transaction level integrity required.

This is the ideal. But most apps cannot do this as the data cannot be split. You can only have 1

data.

In vSphere context, this is what it means by Active/Active vSphere.

Both vSphere are actively running Production VMs

A closer look at Active/Active

250 Prod VMs

Prod Clusters

500 Test/Dev VMs

T/D Clusters

vCenter

500 Test/Dev VMs250 Prod VMs

Prod Clusters T/D Clusters

vCenter

Lots of traffic between:

Prod to Prod

T/D to T/D

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

MAS TRM Guideline

37

It states “near” 0, not 0.

It states “should”, not “must”.

It states “critical”, not all systems. So A/A is only for a subset. This points to an Application-level solution, not Infrastructure-level. We can add this capability without changing the architecture, as shown on next slide.

Adding Active/Active to a mostly Active/Passive vSphere

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

50 VMs

1 Cluster

Global LBGlobal LB

© 2009 VMware Inc. All rights reserved

Thank You

39

Date post:	05-Feb-2016
Category:	Documents
Upload:	arnav
View:	32 times
Download:	0 times

Disaster Recovery 2.0

Documents