Understanding Failure Domains

transcript

Thinking Like An Architect:Understanding Failure Domains

By: Eric Wright

Let’s Talk Architecture

Image by telmo32

Failure DomainsThere are tons of considerations when building any type of architecture.

It doesn’t matter if you’re building a software application or the underlying infrastructure, failure domains are critical.

Quick again: what are failure domains?

They’re regions or components of the infrastructure with the potential for failure.

The regions can be physical or logical, and each region has unique risks and challenges.

Let’s Keep It Simple

Scenario: you’re running a web application with a single Apache server and a MySQL database on two servers.

Here are your risks:

Web server: running a single instance of your web server

Database server: single instance risks loss when the application is potentially unable to attach to database

Network: we were smart enough to separate web and database server. But… that introduces another point of failure.

Simple to see, but what should we do?

Photo by JD Hancock

Don’t Hesitate, Mitigate

Migration is the reduction of risk by some action or design.

Let’s walk through the top strategies for web servers, database servers, and networks.

Web Server MitigationAdding more web servers to handle the requests provides:

• Redundancy• Resiliency

Add a load balancer into the application infrastructure.

This will allow it to accept inbound connections and distribute the requests across the server farm.

Database Server Mitigation

To allow for failures of certain nodes:

You need a horizontally scalable database architecture.

It also ensures data availability during localized outages.

Luckily MySQL can be deployed this way with MariaDB, a distributed relational database to allow for multi-node installations.

Network Mitigation

First, we can add multiple network cards and attach the uplink ports to multiple switches.

This lets us withstand a rack switch outage, a single port outage, or even a simple cable failure.

At the networking layer: make sure the right failsafe designs are in place.

This will prevent routing issues, switch issues, and multiple uplinks to the external network provider for better resiliency.

But… what is the impact of all these solutions?

Have You Ever Heard This Joke?

I had a problem that I decided to use Regex statements to fix.

Now I have 2 problems.

Hilarious, right?

What Can Happen

Adding some more web servers looks easy.

But web farms assume you have a queuing system in your database when you’re doing write functions.

We fixed a single point of failure, but introduced complexity.

This is a key reason why we focus on DevOps concepts.

It’s also why we have the infrastructure and application teams both fully engaged in architecture decisions.

We’re Finished Finally….Right?

You’ve added:

• New servers• Load balancers• Message queuing infrastructure

But what happens if there’s a regional power outage or network outage?

We’re not covered.

Don’t Fall Victim To Analysis Paralysis

There will never be one ultimate solution.

Hopefully, your team loves agile and lean processes so you can try something and iterate in the face of deficiencies and failure domain mitigation.

And for our latest power outage problem?

You can use servers outside your geographical region, or the cloud, or multiple clouds!

Here’s The Point

Nobody wants to get caught when the outage occurs and say, “Oh, I didn’t think of that!”

Be acutely aware of failure domains and scenarios when architecting a solution.

Author

Eric WrightPrincipal Solutions ArchitectVMTurbo

Understanding Failure Domains

Technology