Post on 17-Aug-2015
transcript
Thinking Like An Architect:Understanding Failure Domains
By: Eric Wright
Let’s Talk Architecture
Image by telmo32
Failure DomainsThere are tons of considerations when building any type of architecture.
It doesn’t matter if you’re building a software application or the underlying infrastructure, failure domains are critical.
Quick again: what are failure domains?
They’re regions or components of the infrastructure with the potential for failure.
The regions can be physical or logical, and each region has unique risks and challenges.
3
Let’s Keep It Simple
Scenario: you’re running a web application with a single Apache server and a MySQL database on two servers.
Here are your risks:
Web server: running a single instance of your web server
Database server: single instance risks loss when the application is potentially unable to attach to database
Network: we were smart enough to separate web and database server. But… that introduces another point of failure.
Simple to see, but what should we do?
4
Photo by JD Hancock
Don’t Hesitate, Mitigate
Migration is the reduction of risk by some action or design.
Let’s walk through the top strategies for web servers, database servers, and networks.
5
Web Server MitigationAdding more web servers to handle the requests provides:
• Redundancy• Resiliency
Add a load balancer into the application infrastructure.
This will allow it to accept inbound connections and distribute the requests across the server farm.
6
Database Server Mitigation
7
To allow for failures of certain nodes:
You need a horizontally scalable database architecture.
It also ensures data availability during localized outages.
Luckily MySQL can be deployed this way with MariaDB, a distributed relational database to allow for multi-node installations.
Network Mitigation
8
First, we can add multiple network cards and attach the uplink ports to multiple switches.
This lets us withstand a rack switch outage, a single port outage, or even a simple cable failure.
At the networking layer: make sure the right failsafe designs are in place.
This will prevent routing issues, switch issues, and multiple uplinks to the external network provider for better resiliency.
But… what is the impact of all these solutions?
Have You Ever Heard This Joke?
9
I had a problem that I decided to use Regex statements to fix.
Now I have 2 problems.
Hilarious, right?
What Can Happen
10
Adding some more web servers looks easy.
But web farms assume you have a queuing system in your database when you’re doing write functions.
We fixed a single point of failure, but introduced complexity.
This is a key reason why we focus on DevOps concepts.
It’s also why we have the infrastructure and application teams both fully engaged in architecture decisions.
We’re Finished Finally….Right?
11
You’ve added:
• New servers• Load balancers• Message queuing infrastructure
But what happens if there’s a regional power outage or network outage?
We’re not covered.
Don’t Fall Victim To Analysis Paralysis
12
There will never be one ultimate solution.
Hopefully, your team loves agile and lean processes so you can try something and iterate in the face of deficiencies and failure domain mitigation.
And for our latest power outage problem?
You can use servers outside your geographical region, or the cloud, or multiple clouds!
Here’s The Point
13
Nobody wants to get caught when the outage occurs and say, “Oh, I didn’t think of that!”
Be acutely aware of failure domains and scenarios when architecting a solution.
Author
Eric WrightPrincipal Solutions ArchitectVMTurbo
14