+ All Categories
Home > Technology > Failover or not to failover

Failover or not to failover

Date post: 28-Jan-2015
Category:
Upload: henrik-ingo
View: 182 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
29
Failover, or not Failover, that is the question Percona Live MySQL Conference and Expo 2013 Massimo Brignoli, SkySQL Henrik Ingo, Nokia Please share and reuse this presentation licensed under the Creative Commonse Attribution License
Transcript
Page 1: Failover or not to failover

Failover, or not Failover,that is the questionPercona Live MySQL Conference and Expo 2013Massimo Brignoli, SkySQL Henrik Ingo, Nokia

Please share and reuse this presentation licensed under the Creative Commonse Attribution License

Page 2: Failover or not to failover

Agenda

● Why HA is more difficult for databases● Steps to failover● Monitoring● Automating failover● Sounds great!

What could possibly go wrong?● Amazon Dynamo● Galera and NDB

Page 3: Failover or not to failover

Fault tolerance = redundancy

● RAID● 2 power units per server● Cluster of servers● 2 kidneys per person● Redudancy at all levels:

Software, Hardware, Network, Electricity...

A chain is as strong as the weakest link.

Page 4: Failover or not to failover

Durability

"Durability is an interesting concept.If I flush a transaction to disk,

it is said to be durable.But if I then take a backup, it is even more durable."

Heikki Tuuri

Page 5: Failover or not to failover

Why High Availability is More Difficult for Databases

Redundancy of serverAND

Redundancy of data

WHILEPerforming thousands of write operations

per second onto the dataset

Page 6: Failover or not to failover

What failover?

1. Primary server2. Secondary / Standby server

for redundancy3. In case Primary fails,

Secondary server must become the new Primary

Page 7: Failover or not to failover

Steps to failover (theory)

1. Notice failure2. Move VIP3. Continue

Page 8: Failover or not to failover

Automating failoverGeneric Clustering Solutions

● Pacemaker/Corosync

● Linux Heartbeat

● Red Hat Cluster Suite

● Solaris Cluster

● Windows Server Failover Clustering

● etc...

MySQL Specific Solutions

● MMM

● PRM

● MHA

● JDBC connector

Page 9: Failover or not to failover

Steps to failover (DRBD)

VIPVIP

1. Have DRBD2. Notice failure3. Shutdown MySQL on primary4. Unmount disk on primary5. Mount disk on secondary6. Start MySQL on secondary7. Wait for InnoDB recovery8. Wait for InnoDB recovery9. Wait for InnoDB recovery

10. Unset VIP on primary11. Set VIP on secondary12. Continue13. Should you add a new secondary?

Page 10: Failover or not to failover

Steps to failover (MySQL replication)

VIPVIP

1. Have replication2. Notice failure3. Make slave writable4. Make master read-only5. Unset VIP on master6. Set VIP on slave7. Continue8. Should you add a new slave?

Page 11: Failover or not to failover

What if you have more than 2 servers? (MySQL replication)

VIPVIP

?

● MySQL replication failover with more than 2

servers can be a hassle.

● Which slave should become the new master?

● All slaves must be pointed to the new master.

● They must figure out where to continue

replication (binlog position)

● MySQL 5.6 GTID helps.

Page 12: Failover or not to failover

MHA and SkySQL...

● Combination of resource manager + scripts

● Automating failover process:○ New Master

selection○ Slaves

reconfiguration○ VIP management○ Missing binlogs

retrieval

Page 13: Failover or not to failover

Sounds great, what could possibly go wrong?

Page 14: Failover or not to failover

Sounds great, what could possibly go wrong?

VIP

1. Have replication○ Ok, is it working? What if it's not working?○ Is it replicating in the right direction?○ Does your bash script handle binlog positions correctly?○ Asynchronous?

2. Notice failure○ Polling interval○ Who is polling?○ ...and from where?○ How is he handling failure himself?○ False positives○ Is failover the right response to every failure?

3. STONITH○ Shutdown MySQL on Primary? How? It's not responding...○ Unmount disk on Primary? How? It's not responding...○ "You need a STONITH device"! Hehe, nice try...

4. Move VIP○ Unset VIP on Master/Primary? How? It's not responding...○ Set VIP on Secondary/Slave. This will work fine. Unfortunately.

5. Continue6. Add back new/same Secondary

○ Automatically of course. Even if it just failed 15 seconds ago.

VIP

Page 15: Failover or not to failover

Case Githubhttps://github.com/blog/1261-github-availability-this-week● MySQL replication, Pacemaker, Corosync, Percona Replication Manager● PRM health check fails due to high load during schema migration.● Failover!● New node has cold caches, so even worse performance.● Failover! (back)● Disable PRM● A slave is found outdated as replication is not happening● Enable PRM and hope it will fix it● Pacemaker segfaults, causing cluster partition● PRM selects the outdated node as master, shuts down others● All kinds of data inconsistencies● Restart PRM on all nodes● ...

Page 16: Failover or not to failover

Case GithubLesson learned:

Automated failover is dangerous

Cold cache is dangerous

Page 17: Failover or not to failover

But... Not automating is also dangerous

Baron Schwartz: 75% of replication failures are human errorshttp://www.percona.com/about-us/mysql-white-paper/causes-of-downtime-in-production-mysql-servers

80% of Aviation accidents are caused by human errorshttp://asasi.org/papers/2004/Shappell%20et%20al_HFACS_ISASI04.pdf

80% Events caused by human errors, 70% of them due to organization weaknesseshttp://www.hss.doe.gov/sesa/corporatesafety/hpc/fundamentals.html

Page 18: Failover or not to failover

Are we solving the right problem?

Page 19: Failover or not to failover

Instead of automating the problem...Eliminate the problem!

Page 20: Failover or not to failover
Page 21: Failover or not to failover

Amazon Dynamo

R + W > N

Voldemort, Cassandra, RIAK, DynamoDB, S3http://openlife.cc/blogs/2012/september/failover-evil

Page 22: Failover or not to failover

N=3, R=W=2

R + W > N

Page 23: Failover or not to failover

Eventual consistency is internal only

R + W > N

Page 24: Failover or not to failover

Failover?

Single node failure is a non-event!

Page 25: Failover or not to failover

For relational databases?

Synchronous replication isspecial case of Dynamo:

W=N & R=1

Page 26: Failover or not to failover

Or is there a failover after all?

Due to W=N, writers actually notice node failures! Cluster reconfiguration needed.

(Readers are ok.)

?

Page 27: Failover or not to failover

Example: Galera

OKTimeout

Page 28: Failover or not to failover

Example: MySQL NDB Cluster

Page 29: Failover or not to failover

What have we learned?

● Failover with DRBD is painful because it is slow.

● Failover with MySQL replication is painful because it's a mess.

● Amazon Dynamo has no failover● Galera Cluster has no failover but needs

cluster reconfiguration. Same thing...● MySQL NDB Cluster has failover but you

can't see it.


Recommended