5 Keys to Building High Availability Web Applications

for Service and Microservice Based Systems

Lee Atchison, Principal Cloud Architect and Advocate Confidential ©2008–16 New Relic, Inc. All rights reserved.

You had power most of the time.

Why are you complaining?

How do you keep an

application operational?

5 Keys to High Availability Web Applications

Build applications keeping

availability in mind

Key 5

Develop forfailure

Services will fail

… always.

Services will fail

As a Service Developer…

Your response to

a dependency

failure must be

As a Service Developer…

Your response to

a dependency

failure must be


As a Service Developer…

Your response to

a dependency

failure must be



As a Service Developer…

Your response to

a dependency

failure must be


Reasonable for the given

dependency failure


How should I

respond when a

dependency fails?

Don’t know something? Don’t show it!

§ Don’t show a drop down list of accounts if you can’t contact the account service

§ Don’t show an image (or show a placeholder) if you can’t determine which image to showProvide a

graceful backoff

Example (Real Life)

Our web application showing a page…

One day, that 3rd

party system failedAn avatar was representing the customer on each page

The app didn’t know what to do – so it failed, too

A 3rd party system generated the avatar

Our application was completely down, all because of a minor icon missing...


Why did this cause your application to fail?

§ Recognized the failure of the 3rd party provider as soon as possible

§ Substitute a generic image (or removed it)when the service failure was detected

§ Circuit Breaker pattern would help a lot here

It didn’t know how to respond.It could have:

How should I

respond when a

dependency fails?

Fail as early as possible:

§ Don’t propagate bad data… once you determine a piece of data is invalid, discard it as soon as possible

§ Validate input given…reject bad input immediatelyProvide a

graceful backoff

Example (Real Life)

Account service was having performance problems…

Customers felt a performance problem

Someone was sending bad requests


System had“browned out”


Service tried toprocess the request…

(And eventually failed)


So, what brought our

application to its knees?

§ Input to the service was obviously bad

§ Yet, we attempted to use the input

§ Result was a failed service

The Lesson

Always think about scaling


Just because your application works now does not mean it will

work tomorrow…

Key 5

Build applications keeping

availability in mindOR

Develop for failure

Just because your

application works

now does not mean

it will work

tomorrow… Why?

§ Most web applications have increasing traffic patterns

§ Traffic will increase, double, triple, 10x…sooner than you think

§ Don’t build it for today’s trafficbuild it for tomorrow’s traffic

Build for


might mean:

§ Build in the ability to increase the size and capacity of your databases.

§ Determine what logical limits exist to your data scaling. What happens when your database tops out in its capabilities?

§ Build your application so that you can add additional application servers easily. This often involves being observant about where and how state is maintained, and how traffic is routed.*

§ Think about caching. What information can be cached? What can't? Why can't it?

§ Redirect static traffic to offline providers.§ Think about whether specific pieces of dynamic

content can actually be generated statically.

Example: Is It Static or Dynamic?

Example: Is It Static or Dynamic?

Non-static content

Example: Is It Static or Dynamic?

Non-static content

Banner is now static

Example: Is It Static or Dynamic?

Non-static content

Banner is now static

Personalized content can be added in browser

Always think about scaling


Just because your application works now does not mean it will

work tomorrow…

Mitigate risk

Key 5

Build applications keeping

availability in mindOR

Develop for failure

All Systems Have Risk in Them

Risk is a measure of the likelihood of a surprise occurring

Server will crash

Database will get corrupted

Returned answer will be incorrect

Network connection

will fail

Newly deployed piece of

software will fail

There is risk that a …

§ Keeping a system available requires removing risk…

Hence, removing surprise

§ But as systems become more and more complicated…... this becomes less and less possible

Managing what

your risk is

Managing how much

risk is acceptable

Knowing what you can do to mitigate

the risk

Risk Management

is at the heart of building highly

available systems

Knowing what you can do to mitigate

the risk

Risk mitigation

Risk Mitigation

Risk mitigation is part of risk management

Risk mitigation:

§ Knowing what to do when a problem occurs in order to reduce the impact of the problem

§ Making sure your application works as best and as completely as possible, even when services and resources fail

Risk Mitigation

Risk mitigation requires thinking about the things that can go wrong

… and putting a plan together, now…

to be able to handle the situation when it does happen.

Always think about scaling


Just because your application works now does not mean it will

work tomorrow…

Mitigate risk

Monitor availability


Yes, we can help you

Key 5

Build applications keeping

availability in mindOR

Develop for failure

Monitor Availability

§ Understand how your application is performing

§ Use application monitoring:§ Keep an eye on how your app is performing§ Generate notifications when the application

performs in abnormal ways

§ Make sure your app is properly instrumented§ Internal as well as external to your app

Monitor Availability

§ Have your tools monitor continuously

§ Establish a baseline for how your application is performing

§ Look for trends and patterns

§ Look for outliers and deviations from the trends§ Treat these as potential availability issues

§ As your system grows:§ Examine how your baseline changes§ Make sure your scalability plan will

continue to work

Service Level Agreements

Establish Internal SLAs

Quick diagnoses

“Hot spots” to optimize


Service Level Agreements

Establish Internal SLAs

Quick diagnoses

“Hot spots” to optimize


Critical to building scalable application

Only way to scale an organization in a reliable way is with reliable SLAs

Availability response


Yes, that was your pager that

went off

Always think about scaling


Just because your application works now does not mean it will

work tomorrow…

Mitigate risk

Monitor availability


Yes, we can help you

Key 5

Build applications keeping

availability in mindOR

Develop for failure


When a problem occurs…

§ Do you know what to do to fix the problem?

§ Does everyone on your team know what to do?

§ Do you have playbooks?

§ Does your pager rotation and notification system work?

You must be prepared to act on issues.

This means:

§ Alerts that reach the needed individuals

§ Prepared processes and procedures for common failure modes(this is part of risk mitigation process)

When an alert is triggered…

§ Owner of that service must be first ones alerted

§ Other teams may want to be alerted as well…§ Services that are tightly dependent on

triggered service§ Early warning notification for upstream

or downstream issues§ May want a “second level” notification

BEFORE the problem occurs:

§ Well established plans

§ Documented processes and cheat sheets

§ Contact lists for critical consuming service owners§ Clear, precise escalation plan:§ Who to contact if problem becomes too

big for responder to handle§ If scope of problem extends significantly

and critically beyond failing system§ Know who to escalate if first responder doesn’t

5 Keys to High Availability Web Apps

Availability response

Build applications keeping

availability in mind

Always think about scaling

Mitigate risk

Monitor availability

Key 5

Thank you for your time!

Questions?Lee [email protected] @leeatchison leeatchison

Architecting for Scale

Published by: O’Reilly MediaAvailable: May 2016www.architectingforscale.com

