Resilient Functional Service Design

transcript

Resilient Functional Service Design The usually forgotten parts of resilient software design

Uwe Friedrichsen – codecentric AG – 2015-2017

@ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

What’s that “resilience” thing?

Business

Production

Availability

(Almost) every system is a distributed system

Chas Emerick

http://www.infoq.com/presentations/problems-distributed-systems

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.

Leslie Lamport

Failures in todays complex, distributed and interconnected systems are not the exception. •  They are the normal case

•  They are not predictable

… and it’s getting “worse”

•  Cloud-based systems

•  Microservices

•  Zero Downtime

•  Mobile & IoT

•  Social Web

à Ever-increasing complexity and connectivity

Do not try to avoid failures. Embrace them.

resilience (IT) the ability of a system to handle unexpected situations

-  without the user noticing it (best case) -  with a graceful degradation of service (worst case)

Beware of the “100% available” trap!

Designing for resilience The pitfall

First, you learn about resilience …

Complement

Detect

Prevent

Recover

Mitigate

Detect Treat

Prevent

Recover

Mitigate Complement

Supporting patterns

Redundancy

Stateless

Idempotency

Escalation

Zero downtime deployment

Location transparency

Relaxed temporal

constraints

Fallback

Shed load Share load

Marked data Queue for resources

Bounded queue

Finish work in progress

Fresh work before stale

Deferrable work Communication paradigm

Isolation

Bulkhead System level

Monitor

Watchdog

Heartbeat

Acknowledgement

Either level

Voting

Synthetic transaction

Leaky bucket Routine

checks

Health check

Fail fast

Let sleeping dogs lie

Small releases

Hotdeployments

Routine maintenance

Backuprequest

Anti-fragility

Diversity Jitter

Error injection

Spread the news

Anti-entropy

Backpressure

Limit retries

Rollback Roll-forward

Checkpoint Safe point

Failover

Read repair

Error handler

Reset Restart

Reconnect

Fail silently

Default value

Node level

Timeout

Circuit breaker

Complete parameter checking

Checksum

Statically

Dynamically

Confinement

... then, you digest the stuff just learned

Confinement

Detect Treat

Prevent

Recover

Mitigate Complement

Supporting patterns

Redundancy

Stateless

Idempotency

Escalation

Zero downtime deployment

Location transparency

Relaxed temporal

constraints

Fallback

Shed load Share load

Marked data Queue for resources

Bounded queue

Finish work in progress

Fresh work before stale

Deferrable work Communication paradigm

Isolation

Bulkhead System level

Monitor

Watchdog

Heartbeat

Acknowledgement

Either level

Voting

Synthetic transaction

Leaky bucket Routine

checks

Health check

Fail fast

Let sleeping dogs lie

Small releases

Hotdeployments

Routine maintenance

Backuprequest

Anti-fragility

Diversity Jitter

Error injection

Spread the news

Anti-entropy

Backpressure

Limit retries

Rollback Roll-forward

Checkpoint Safe point

Failover

Read repair

Error handler

Reset Restart

Reconnect

Fail silently

Default value

Node level

Timeout

Circuit breaker

Complete parameter checking

Checksum

Statically

Dynamically

Oh, my! Theoretical blah!

Uncool!

Know that anyway for eons. So, let’s

move on to the cool parts …

Ah, now we’re talkin’! Here’s the cool stuff!

That‘s practical, applicable. Don‘t you have more code examples? Or even better: Can‘t we turn that all into a live hacking session?

Offline activities?

Hmm, let‘s

focus on the other stuff.

Uh, sounds like one-off, tough

stuff …

Better start with the easier stuff, best

with library support

Yeah, more cool stuff!

Aren‘t there more libs like Hystrix that we can drag into our projects

with a line of configuration?

Well, neat …

I’ll come back to that stuff whenever

I really need it

Detect

Recover

Mitigate

Prevent

Complement

Developer priority

Relevance for application robustness

Ye be warned!

If you don’t get this part right, nothing else matters

Here be dragons!

This is extremely hard and poorly understood

Let’s recap …

The core parts are

•  extremely important

•  poorly understood

•  massively underestimated

Houston, we have a problem!

Let’s have a closer look at the core parts

Complement

Detect

Prevent

Recover

Mitigate

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

•  System must not fail as a whole

•  Split system in parts and isolate parts against each other

•  Avoid cascading failures

•  Foundation of resilient software design

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

Bulkhead

Bulkheads are not about thread pools!

Bulkheads

•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)

•  Diverse implementation choices available, e.g., µservice, actor, scs, ...

•  Implementation choice impacts system and resilience design a lot

•  Shaping good bulkheads is extremely hard (pure design issue)

Sounds easy. Where is the problem?

Service A Service B Request

Due to functional design, Service A always needs backing from Service B to be able to answer a client request,

i.e. the isolation is broken by design

How do we avoid this …

Service

Request

Due to functional design we need to call a lot of services to be able

to answer a client request,

i.e. availability is broken by design

... and this ...

Service

Service Service

Service

Mothership Service

(a.k.a. Monolith) Request

By trying to avoid the aforementioned issues we ended up with cramming all

required functionality in one big service

... without ending up with this?

Let’s use the well-known best practices

•  Divide & conquer a.k.a. functional decomposition

•  DRY (Don’t Repeat Yourself )

•  Design for reusability

•  Layered architecture

•  …

Unfortunately, …

... this usually leads to this ...

Service

Request

Due to functional design we need to call a lot of services to be able

to answer a client request,

i.e. availability is broken by design

... and this ...

Service

Service Service

Service

Mothership Service

(a.k.a. Monolith) Request

By trying to avoid the aforementioned issues we ended up with cramming all

required functionality in one big service

... and in the end also often to this.

Welcome to distributed hell!

Caches to the rescue!

Break tight service coupling by caching data/responses

of downstream service

Caches to the rescue?

Do you really thinkthat copying stale data all over your system

is a suitable measure to fix an inherently broken design?

We have to re-learn design for distributed systems!

A works-out-of-the-box-in-all-contexts, just-add-water-and-stir,

three-bullet-point panacea for designing perfect bulkheads

You need lots of those …

... maybe some of those

Then it is a lot of hard work …

... and there is no silver bullet

Yet, a few guiding thoughts about bulkhead design …

Foundations of design •  High cohesion, low coupling

•  Separation of concerns

•  Crucial across process boundaries

•  Still poorly understood issue

•  Start with •  Understanding organizational boundaries

•  Understanding use cases and flows

•  Identifying functional domains (à DDD)

•  Finding areas that change independently

•  Do not start with a data model!

Short activation paths

•  Long activation paths affect availability

•  Increase latency and likelihood of failures

•  Minimize remote calls per request

•  Need to balance opposing forces

•  Avoid monolith à clear separation of concerns

•  Minimize requests à cluster functionality & data

•  Caches sometimes help, but stale data as trade-off

Dismiss reusability

•  Reusability increases coupling

•  Reusability leads to bad service design

•  Reusability compromises availability

•  Reusability rarely pays

•  Do not strive for reuse

•  Strive for replaceability instead

Broadening the options ...

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

Communication paradigm

Bulkhead

Communication paradigm

•  Request-response <-> messaging <-> events <-> …

•  Heavily influences resilience patterns to be used

•  Also heavily influences functional bulkhead design

•  Very fundamental decision which is often underestimated

Request/Response : Horizontal slicing

Flow / Process

µS µS

µS µS µS

Event-driven : Vertical slicing

µS µS

Flow / Process

Synchronous R/R vs. asynchronous events

•  Decomposition •  Vertically divide-and-conquer vs. horizontally go-with-the-flow

•  Coordination •  Coordination logic/services and orchestration vs. event chains and choreography

•  Transactions •  Built-in transaction handling vs. external supervision

•  Error handling •  Built into service vs. escalation/supervision strategy

•  Separation of concerns •  Multiple responsibilities service vs. single responsibility services

•  Encapsulation •  Domain logic distributed across services vs. domain logic in one place •  Reusability vs. Replaceability

•  Complexity •  A draw …

The communication paradigm influences the functional service design a lot

and also the resilience patterns to be used

Example: order fulfillment •  Simple order, credit card, non-digital items

•  Add coupons incl. validation •  Add promotions incl. usage notification •  Add bonus card incl. purchase notification

•  Customer accounts as payment type •  PayPal as payment type

•  Integrate digital music library •  Integrate digital video library •  Integrate e-book library

Design exercise – Part 1 Create a bulkhead design for the case study •  Use one communication paradigm

•  Synchronous request/response (e.g., REST) •  Asynchronous messaging (e.g., Akka) •  Asynchronous events (e.g., Pub/Sub)

•  Assume incremental requirements •  How many services do you need to touch •  What about the functional isolation of the services •  How big/maintainable are the resulting services •  Take a few notes

Online Shop Checkout

Credit Card Provider

Warehouse System

Coupon Management

Campaign Management

PayPal

Loyalty Management

Accounts Receivables Music Library

E-Book Library Video Library

E-Mail Server

Customer pressed

“Buy now”

Order Fulfillment Service

Online Shop

Payment Service

Shipment Service

Warehouse System

Coupon Management

Promotion Campaign

Management Loyalty

Account Service

Payment Provider

PayPal

Loyalty Management

Accounts Receivables

Music Library

E-Book Library

Video Library

E-Mail Server

Coupon

Credit Card

Coordinate

Warehouse

Coordinate

Assets

Notify Cust.

PayPal

Coordinate

Order confirmed

Online Shop

Warehouse System

Coupon Management

Campaign Management

Account service

Credit Card Service

Loyalty Management

Accounts Receivables

Music Library

E-Book Library

Video Library E-Mail Server

PayPal

PayPal Service

Warehouse Service

Promotion Service

Bonus Card Service

Coupon Service

Music Library Service

Video Library Service

E-Book Library Service

Notification Service

Payment authorized Digital asset provisioned

Payment failed

<Event>

Order fulfillment supervisor

Track flow of events Reschedule events in case of failure

Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that

Do not limit your design options upfront without an important reason

Wrap-up

•  Today’s systems are distributed •  Failures are not avoidable, nor predictable •  Resilient software design needed

•  Bulkhead design is •  crucial for application robustness •  poorly understood •  massively underrated •  different from traditional design best practices

•  Communication paradigms broaden your bulkhead design options

We have to re-learn design for distributed systems

@ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

Resilient Functional Service Design

Technology