+ All Categories
Home > Documents > S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor...

S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor...

Date post: 15-Dec-2015
Category:
Upload: angelique-trease
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
SECONDSITE: DISASTER TOLERANCE AS A SERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield
Transcript
Page 1: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

SECONDSITE: DISASTER TOLERANCE AS A SERVICE

Shriram Rajagopalan

Brendan Cully

Ryan O’Connor

Andrew Warfield

Page 2: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

2

FAILURES IN A DATACENTER

Page 3: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

3

TOLERATING FAILURES IN A DATACENTER

Initial idea behind Remus was to tolerate Datacenter level failures.

REMUS

Page 4: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

4

CAN A WHOLE DATACENTER FAIL ?

Yes!It’s a “Disaster”!

Page 5: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

5

DISASTERS

Illustrative Image courtesy of TangoPango, Flickr.

“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”

- Om Malik, GigaOM

“Truck driver in Texas kills all the websites you really use”

…Southlake FD found that he had low blood sugar

- valleywag.com

Page 6: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

6

DISASTERS..

Water-main break cripples Dallas County computers, operations

The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.

- Dallas Morning News, Jun 2010

Page 7: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

7

DISASTERS..

Page 8: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

8

MORE FODDER BACK HOME

“An explosion … near our

server bank … electrical box containing 580 fiber cables.

electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....

Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.

In other words, the perfect storm. Oh well. S*it happens. ’’

-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

Page 9: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

9

DISASTER RECOVERY – THE OLD FASHIONED WAY

Storage replication between a primary and backup site.

Manually restore physical servers from backup images.

Data Loss and Long Outage periods.

Expensive Hardware – Storage Arrays, Replicators, etc.

Page 10: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

10

Protected Site

Recovery Site

VirtualCenter Site Recovery Manager

VirtualCenter Site Recovery Manager

Datastore Groups

Array Replication

Datastore GroupsX

STATE OF THE ART DISASTER RECOVERY

VMs offline

VMs powered on

VMs become unavailable

VMs online in Protected Site

Source: VMWare Site Recovery Manager – Technical Overview

Page 11: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

11

PROBLEMS WITH EXISTING SOLUTIONS

Data Loss & Service Disruption (RPO ~15min, RTO ~few hours)

Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)

Application Level Recovery

Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.

Page 12: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

12

DISASTER TOLERANCE AS A SERVICE ?

Our Vision

Page 13: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

13

OVERVIEW

A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 14: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

14

PRIMARY & BACKUP SITES

5ms RTT

Page 15: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

15

FAILOVER & FAILBACK WITHOUT OUTAGE

Primary Site: VancouverBackup Site : Kamloops

Primary Site: VancouverPrimary Site: Kamloops

Primary Site: KamloopsBackup Site : Vancouver

Complete State Recovery (CPU, disk, memory, network)

No Application Level Recovery

Page 16: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

16

MAIN CONTRIBUTIONS

Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency

No Application level recovery

RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using

Page Delta Compression Disk Read Tracking

SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area

Page 17: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

17

CONTRIBUTIONS..

Page 18: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

18

FAILURE DETECTION IN REMUS

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

• A pair of independent dedicated NICs carry replication traffic.

• Backup declares Primary failure only if

• It cannot reach Primary via NIC 1 and NIC2

• It can reach External N/W via NIC1

• Failure of Replication link alone results in Backup shutdown.

• Split Brain occurs only when both NICs/links fail.

LAN

Page 19: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

19

FAILURE DETECTION IN WIDE AREA DEPLOYMENTS

Cannot distinguish between link and node failure.

Higher chances of Split Brain as the network is not reliable anymore

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

LAN

WAN

PrimaryDatacent

er

BackupDatacent

er

ReplicationChannel

INTERNET

Page 20: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

20

FAILOVER ARBITRATION

Local Quorum of Simple Reachability Detectors.

Stewards can be placed on third party clouds.

Google App Server implementation with ~100 LoC.

Provider/User could have other sophisticated implementations.

Page 21: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

21

Stewards1 2 3

4 5

FAILOVER ARBITRATION..

Replication Stream

POLL

1

Primary

QuorumLogic

Backup

QuorumLogic

Apriori Steward Set Agreement

I need majority to stay alive

I need exclusive majority to

failover

XX

XX

X

POLL

2PO

LL 3

POLL 4

POLL 5POLL 1

POLL 2POLL 3

POLL 4

POLL 5

Page 22: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

22

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION

Remus – LAN - Gratuitous ARP from Backup Host

SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter

Need support from upstream ISP(s) at both Datacenters

IP Migration achieved through BGP Multi-homing

Page 23: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

23

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..

Internet

BCNet (AS-271)

VMs

Vancouver(134.87.2.173

)

Kamloops(207.23.255.23

7)

134.87.2.174

AS-64678 (stub)(134.87.3.0/24)

207.23.255.238

VMs VMs

Primary Site Backup Site

AS-64678 (stub)(134.87.3.0/24)

BGP Multi-homing

Replication

Routing traffic to Primary Site

Re-routing traffic to Backup Site on Failover

as-path prepend64678 64678

as-path prepend64678 64678 64678 64678

as-path prepend64678

Page 24: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

24

OVERVIEW

A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 25: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

25

I want periodic failovers with no downtime!

Did you run regression tests ?

Failover Works!!

More than one failure ?

I will have to restart HA!

EVALUATION

Page 26: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

26

RESTARTING HA

Need to Resynchronize Storage.

Avoiding Service Downtime requires Online Resynchronization

Leverage DRBD –only resynchronizes blocks that have changed

Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.

Page 27: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

27

REGRESSION TESTS

Synthetic Workloads to stress test the Replication Pipeline

Failovers every 90 minutes

Discovered some interesting corner cases

Page-table corruptions in memory checkpoints

Write-after-write I/O ordering in disk replication

Page 28: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

28

SECONDSITE – THE COMPLETE PICTURE

• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable

4 VMs x 100 Clients/VM

Page 29: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

29

REPLICATION BANDWIDTH CONSUMPTION

4 VMs x 100 Clients/VM

Page 30: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

30

DEMO

Expect a real disaster (conference demos are not a good idea!)

Page 31: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

31

APPLICATION THROUGHPUT VS. REPLICATION LATENCY

SPECWeb w/ 100 Clients

Kamloops

Page 32: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

32

RESOURCE UTILIZATION VS. APPLICATION LOAD

Domain-0 CPU Utilization Bandwidth usage on Replication Channel

Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

Page 33: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

33

RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD

OLTP Workload

Page 34: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

34

The user creates a recovery plan which is associated to a single or multiple protection groups

SETUP WORKFLOW – RECOVERY SITE

Source: VMWare Site Recovery Manager – Technical Overview

Page 35: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

35

RECOVERY PLAN

VM Shutdown

High PriorityVM Recovery

Prepare Storage

High PriorityVM Shutdown

Normal PriorityVM Recovery

Source: VMWare Site Recovery Manager – Technical Overview

Low PriorityVM Recovery


Recommended