Practical Guidelines for Moab Stacks

Post on 19-Jul-2015

191 views 4 download

transcript

© 2013 ADAPTIVE COMPUTING, INC. 1

Practical Guidelines for Highly Available Moab Stacks

Daniel Hardman, Chief Solutions Architect

@dhh1128 ~ http://codecraft.co ~ http://gplus.to/danielhardman ~ http://lnkd.in/z7PTAR

dhardman@adaptivecomputing.com

April 2013

© 2013 ADAPTIVE COMPUTING, INC. 2 © 2013 ADAPTIVE COMPUTING, INC. 2

The Goal of HA

…NOT! :-)

© 2013 ADAPTIVE COMPUTING, INC. 3 © 2013 ADAPTIVE COMPUTING, INC. 3

The real goals of HA

▪  Eliminate or reduce “downtime” for running jobs

▪  Eliminate or reduce “downtime” for new submissions

▪  Make failovers visible and manageable ▪  Satisfy regulatory requirements ▪  Preserve audit trail

© 2013 ADAPTIVE COMPUTING, INC. 4 © 2013 ADAPTIVE COMPUTING, INC. 4

HA is constrained by time, money

How much are you willing to spend to tolerate: ▪  A power outage? ▪  A software crash? ▪  A hacker from unit 61398 in Shanghai? ▪  The Chelyabinsk meteor? ▪  The Chicxulub meteor that wiped out the

dinosarus?

© 2013 ADAPTIVE COMPUTING, INC. 5 © 2013 ADAPTIVE COMPUTING, INC. 5

What is “downtime”?

0 – hardware failure

+3 min – usable, but very slow

-30 min – last checkpoint

+10 min – full restore

© 2013 ADAPTIVE COMPUTING, INC. 6 © 2013 ADAPTIVE COMPUTING, INC. 6

4 Basic Recipes

▪  Simple built-in HA ▪  Standard pairwise HA ▪  Shared pairwise HA ▪  Advanced HA

© 2013 ADAPTIVE COMPUTING, INC. 7

Recipe 1: simple, built-in HA

© 2013 ADAPTIVE COMPUTING, INC. 8 © 2013 ADAPTIVE COMPUTING, INC. 8

Simple, built-in HA

▪  hot ~ warm (daemons idle on fallback svr)

▪  Moab, TORQUE ▪  shared file system, synced clocks, two daemons,

last mod date on semaphore

▪  MAM ▪  DB replication, primary and fallback server

© 2013 ADAPTIVE COMPUTING, INC. 9 © 2013 ADAPTIVE COMPUTING, INC. 9

Sample deployment (simple, built-in HA)

© 2013 ADAPTIVE COMPUTING, INC. 10 © 2013 ADAPTIVE COMPUTING, INC. 10

Pros and cons (simple, built-in HA)

▪  Pros ▪  Fast and easy to set up ▪  Minimal learning curve

▪  Cons ▪  Doesn’t protect the solution DB, MWS, Viewpoint ▪  Depends on synchronized clocks, reliable

propagation of file metadata in shared fs ▪  Risk of false triggers ▪  Shared FS may be single point of failure,

depending on how it’s implemented

© 2013 ADAPTIVE COMPUTING, INC. 11

Recipe 2: standard, pairwise HA

© 2013 ADAPTIVE COMPUTING, INC. 12 © 2013 ADAPTIVE COMPUTING, INC. 12

Standard, pairwise HA

▪  Twin headnodes (all daemons) ▪  hot ~ cold (daemons inert on fallback svr) ▪  Heartbeat, redhat clustering ▪  Replicated FS (DRBD)

© 2013 ADAPTIVE COMPUTING, INC. 13 © 2013 ADAPTIVE COMPUTING, INC. 13

Sample deployment (standard, pairwise HA)

© 2013 ADAPTIVE COMPUTING, INC. 14 © 2013 ADAPTIVE COMPUTING, INC. 14

Pros and cons (standard, pairwise HA)

▪  Pros ▪  All services fail over the same way ▪  Heartbeat is robust, well understood ▪  FS can’t be a single point of failure

▪  Cons ▪  Some vulnerability to “split brain” scenario ▪  More learning curve ▪  More complexity than simple, built-in HA

© 2013 ADAPTIVE COMPUTING, INC. 15

Recipe 3: shared, pairwise HA

© 2013 ADAPTIVE COMPUTING, INC. 16 © 2013 ADAPTIVE COMPUTING, INC. 16

Shared, pairwise HA

▪  Twin headnodes (all daemons) ▪  hot ~ warm (some daemons inert, some

idle on fallback svr) ▪  Heartbeat, redhat clustering ▪  DB failover ▪  Shared FS (e.g., GFS2)

© 2013 ADAPTIVE COMPUTING, INC. 17 © 2013 ADAPTIVE COMPUTING, INC. 17

Sample deployment (shared, pairwise HA 1)

© 2013 ADAPTIVE COMPUTING, INC. 18 © 2013 ADAPTIVE COMPUTING, INC. 18

Sample deployment (shared, pairwise HA 2)

© 2013 ADAPTIVE COMPUTING, INC. 19 © 2013 ADAPTIVE COMPUTING, INC. 19

Pros and cons (shared, pairwise HA)

▪  Pros ▪  Solves “split brain” scenario ▪  May have slightly lower latency

▪  Cons ▪  Greater learning curve ▪  More complexity

© 2013 ADAPTIVE COMPUTING, INC. 20

Recipe 4: advanced HA

© 2013 ADAPTIVE COMPUTING, INC. 21 © 2013 ADAPTIVE COMPUTING, INC. 21

Advanced HA

▪  Each service (potentially) split onto dedicated box

▪  Daemons are paired and fail over with heartbeat, redhat clustering

▪  DB failover ▪  Replicated or shared FS

© 2013 ADAPTIVE COMPUTING, INC. 22 © 2013 ADAPTIVE COMPUTING, INC. 22

Advanced HA

This is less of a recipe, and more of a general pattern. Each unique server role has to have N-way redundancy. Complexity of config is high; we recommend involvement of professional services.

© 2013 ADAPTIVE COMPUTING, INC. 23 © 2013 ADAPTIVE COMPUTING, INC. 23

Pros and cons (advanced HA)

▪  Pros ▪  Can meet very aggressive SLAs ▪  Can be tailored and fine-tuned

▪  Cons ▪  Major implementation effort ▪  Requires sophisticated learning and monitoring

© 2013 ADAPTIVE COMPUTING, INC. 24 © 2013 ADAPTIVE COMPUTING, INC. 24

General Observations

▪  Important to audit ▪  Super-fast failover not a goal in our

recipes ▪  Security implications ▪  Not perf enhancer ▪  Not scalability enhancer ▪  Not DR

© 2013 ADAPTIVE COMPUTING, INC. 25 © 2013 ADAPTIVE COMPUTING, INC. 25

More Info

Whitepaper now available. Email me (dhardman@adaptivecomputing.com) for a copy, or download from /documents/ha-moab-cloud-hpc.pdf. Documentation for Hopper release includes a new HA task guide for simple, built-in HA configuration.