Date post: | 18-Nov-2014 |
Category: |
Technology |
Upload: | dyn |
View: | 5,714 times |
Download: | 0 times |
Failover and Global Server Load Balancing for Be4er Network Availability
Jeremy Hitchcock CEO
Dynamic Network Services
Overview
• Problem space: Keeping services up
• About Failover and GSLB
• Case Study: Roll your own CDN in...quick • Case Study: Speed and Stability
• Case Study: DR You can Sleep On • General lessons for network availability
You are probably…
• SoJware service provider • Completely online
• UpLme and revenue directly related
• Audience is internaLonal (non-‐geographical)
So is everyone (lot more of us)!
Mean Time Between Failures (MTBF) (Local)
Fiber Cuts (Network/global)
Failures Are a Way of Life
• Affects bo4om line
• Gets people paged
• Brands loose value
A Be4er Way?
• Current tools: in-‐house scripts, appliances, CDN networks
• Either high opex or capex • New opLons in infrastructure • Example: – 5-‐10 person [boot-‐strapped] companies rolling self-‐healing, auto-‐provisioning networks
OpLmizing The Wrong Part
• Hardware redundancy is expensive • Single point of failures are bad
• Infrastructure is not a core funcLon • Things break, everything auto
• Easier (cheaper) than you think
RealizaLons
• Things break, route around outages • Infrastructure providers a plenty today • Users more sensiLve to outages
• Internet users are around the world – Speed of light is sLll c – RTT of 100m with 50 objects adds up
Traffic management is criBcal
Different Architectures, Different Results
Old New
Use hardware redundancy, local Use soJware redundancy
Super-‐site build out Regionalize, all over-‐provisioned
Page on failure, fix based on page Email report in morning
Planned deployments AutomaLc load handling
Single master datacenter Many POPs, all closer to users
DR is a passive, manual failover DR and failover blended together
New Tools (new to some)
• AutomaLc failover • Global server load balancing • CDN balancing/managing
• Opex relaLve to actual usage • Avoid capex step funcLons
• Two acLve components,
traffic switch
• Implies external monitoring
• Hide outages
Failover
Standard operaLon
On Failover
Failover Use Cases
• Two servers for www.domain.com – On failure, redirect from one to the other
– Works via DNS – Redirect to a staLc page
• Requirements – External monitoring point
– External DNS – Low DNS caching TTL values
• More than two acLve
components
• Traffic management
– TargeLng (geo, network) – WeighLng (percent)
• Failover plus opLmize RTT
• Hostname to A record mapping
Global Server Load Balancing (GSLB)
Global Server Load Balancing Use Cases
• Regionalize eyeballs/end-‐users • Internet outages/subpar speeds avoided • Weight based on load, percentages
• Requirements: – Same as failover – Bit of math/algorithms to balance traffic – Many to many mappings
• Two complete systems
• Balance between CDNs
– Bandwidth commits
– Regional advantages
• Works on CNAMEs
CDN Management
CDN Manager
• Try out a mix of networks – CDNs, infrastructure providers
• Be4er manage traffic – Cost/performance reasons
• Requirements – Same as GSLB but with DNS alias CNAMEs
• Internet doesn't care about domain.com
• twi4er.com 128.121.146.228
• Lot of tricks you can do here
Traffic Cop: DNS
Lenses and OpLons
• EvaluaLon Criteria – SoJ/hard costs, capital/operaLng costs
• Outcome based – Determine your metrics, test those
• PotenLal Outcomes – Roll it in house – CDN Network – Hardware appliances – SaaS-‐based
Which one is be4er?
• Roll it in house – Mid-‐high capex, higher than you think opex – Lots of soJ-‐costs, applicaLon specific though
• CDN Network – Li4le capex, high opex – Some have more knobs than others
• Hardware appliances – High capex, low opex – Need to make full investment into architecture
• SaaS-‐based – Li4le capex, low-‐mid opex – Let others worry about this for you
Case Study 1 Roll your own CDN in...quick
Wikia and regionalizing CDNs for be4er delivery
CDN Choice and Transparency
• Lots of CDNs – Two great public ones – 30 (more?) private providers – Telco/ISP opLons
• Currently give customer hostname – (customer.cdn.com)
• Only test with live traffic
CDN Manager: Enabling TesLng
• Segment traffic and test • Try 2 or 10 CDNs • Low risk method to collect data
• Data collecLon has to be from end points – Your office computer is not the Internet
• Can be4er rate cost/performance
CDN Manager: Wikia
• Wikia runs several niche wikis (audience) • OpLmize traffic delivery for those niches
• Wanted to determine the best CDN based on actual data
CDN Manager: Wikia
• In America, use CDN • In Europe, use their own • Why? Who knows, but it’s the best for their traffic
Discussion
• Not all CDNs are the same • MulLple relaLonships to manage
• Cost control/performance of CDNs
• Audience and economies drive decisions
Case Study 2 Speed and Stability
Twi4er and keeping up
Speed and Stability
• All Internet sites have DNS – Range from good, bad, ugly
• Online services must be fast and accurate – Latency and upLme are what ma4ers
• Things fail all the Lme, sends users to what works
Speed and Stability: Twi4er
• Spiky and growing traffic (like a lot) • Things change too fast to keep up • Load balance a lot • Easier to scale core competencies
• One less thing to worry about
Speed and Stability: Twi4er
• DNS part of system to make site work • Desire not to be an expert in it • Huge, wide spread audience • Online-‐only service
Discussion
• When infrastructure changes rapidly, external monitoring good
• Failover message is be4er than Lmeouts
• Keep traffic regionalize through targeLng
• Outsource non-‐core competencies
• Latency affects page views or ad revenue
Case Study 3: Disaster Recovery You Can Sleep With
37 Signals and doing what needs to get done
Disaster Recovery ImplementaLon
Requirements – One good facility (A) – One backup facility (B) – Ability to recognize facility A is out – Ability to direct traffic from A to B
Authorize.net Interlude
• DR implementaLon Lmeline – Late-‐July: move to new DR facility and plan – July 2: fire at Fisher Plaza (unplanned) – July 3: …
• Only missing a traffic engineering switch • TTLs (DNS record caching) a big difference – SLll a problem today – secure.authorize.net. 86400 IN A 64.94.118.32
• Fully discussion: h4p://bit.ly/23mayf
DR: 37 Signals
• Cloud based SaaS tools, have to be up • External DNS important for controlling traffic
• What if facility A is down and DNS is only at A?
• External DNS means failover/DR possible
Discussion
• Ensuring full replicaLon is usually easy • Traffic management, is usually the problem
• Confuse cold assets/warm spare/hot acLve
• People wait unLl they have an outage to implement DR
Overall Notes
• Networked services need to be rock solid • Failover, GSLB, and CDNM are within reach
• Wikia, Twi4er, and 37 Signals using external traffic management for their applicaLon
• Audience ma4ers, so does tesLng and benchmarking
• DynTini
twi4er.com/dynLni
Copy of presentaLon?
Leave a business card in back (or talk to me aJerwards) and I’ll send it to you
Dynamic Network Services, Inc. 1230 Elm St. FiJh Floor Manchester, NH 03101
+1 888.840.3258 [email protected] dyn.com
Join us for drinks: dynLni.com Follow us on Twi4er: @DynInc
Contact Us
Uptime Is the
Bottom Line.