Resilient Functional Service Design The usually forgotten parts of resilient software design
Uwe Friedrichsen – codecentric AG – 2015-2017
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
What’s that “resilience” thing?
Business
Production
Availability
(Almost) every system is a distributed system
Chas Emerick
http://www.infoq.com/presentations/problems-distributed-systems
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
Leslie Lamport
Failures in todays complex, distributed and interconnected systems are not the exception. • They are the normal case
• They are not predictable
… and it’s getting “worse”
• Cloud-based systems
• Microservices
• Zero Downtime
• Mobile & IoT
• Social Web
à Ever-increasing complexity and connectivity
Do not try to avoid failures. Embrace them.
resilience (IT) the ability of a system to handle unexpected situations
- without the user noticing it (best case) - with a graceful degradation of service (worst case)
Beware of the “100% available” trap!
Designing for resilience The pitfall
First, you learn about resilience …
Complement
Core
Detect
Prevent
Recover
Mitigate
Treat
Core
Detect Treat
Prevent
Recover
Mitigate Complement
Supporting patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime deployment
Location transparency
Relaxed temporal
constraints
Fallback
Shed load Share load
Marked data Queue for resources
Bounded queue
Finish work in progress
Fresh work before stale
Deferrable work Communication paradigm
Isolation
Bulkhead System level
Monitor
Watchdog
Heartbeat
Acknowledgement
Either level
Voting
Synthetic transaction
Leaky bucket Routine
checks
Health check
Fail fast
Let sleeping dogs lie
Small releases
Hotdeployments
Routine maintenance
Backuprequest
Anti-fragility
Diversity Jitter
Error injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback Roll-forward
Checkpoint Safe point
Failover
Read repair
Error handler
Reset Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete parameter checking
Checksum
Statically
Dynamically
Confinement
... then, you digest the stuff just learned
Confinement
Core
Detect Treat
Prevent
Recover
Mitigate Complement
Supporting patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime deployment
Location transparency
Relaxed temporal
constraints
Fallback
Shed load Share load
Marked data Queue for resources
Bounded queue
Finish work in progress
Fresh work before stale
Deferrable work Communication paradigm
Isolation
Bulkhead System level
Monitor
Watchdog
Heartbeat
Acknowledgement
Either level
Voting
Synthetic transaction
Leaky bucket Routine
checks
Health check
Fail fast
Let sleeping dogs lie
Small releases
Hotdeployments
Routine maintenance
Backuprequest
Anti-fragility
Diversity Jitter
Error injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback Roll-forward
Checkpoint Safe point
Failover
Read repair
Error handler
Reset Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete parameter checking
Checksum
Statically
Dynamically
Oh, my! Theoretical blah!
Uncool!
Know that anyway for eons. So, let’s
move on to the cool parts …
Ah, now we’re talkin’! Here’s the cool stuff!
That‘s practical, applicable. Don‘t you have more code examples? Or even better: Can‘t we turn that all into a live hacking session?
Offline activities?
Hmm, let‘s
focus on the other stuff.
Uh, sounds like one-off, tough
stuff …
Better start with the easier stuff, best
with library support
Yeah, more cool stuff!
Aren‘t there more libs like Hystrix that we can drag into our projects
with a line of configuration?
Well, neat …
I’ll come back to that stuff whenever
I really need it
Core
Detect
Recover
Mitigate
Treat
Prevent
Complement
Developer priority
Relevance for application robustness
Ye be warned!
If you don’t get this part right, nothing else matters
Here be dragons!
This is extremely hard and poorly understood
Let’s recap …
The core parts are
• extremely important
• poorly understood
• massively underestimated
Houston, we have a problem!
Let’s have a closer look at the core parts
Complement
Core
Detect
Prevent
Recover
Mitigate
Treat
Core
Detect Treat
Prevent
Recover
Mitigate Complement
Isolation
Isolation
• System must not fail as a whole
• Split system in parts and isolate parts against each other
• Avoid cascading failures
• Foundation of resilient software design
Core
Detect Treat
Prevent
Recover
Mitigate Complement
Isolation
Bulkhead
Bulkheads are not about thread pools!
Bulkheads
• Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
• Diverse implementation choices available, e.g., µservice, actor, scs, ...
• Implementation choice impacts system and resilience design a lot
• Shaping good bulkheads is extremely hard (pure design issue)
Sounds easy. Where is the problem?
Service A Service B Request
Due to functional design, Service A always needs backing from Service B to be able to answer a client request,
i.e. the isolation is broken by design
How do we avoid this …
Service
Request
Due to functional design we need to call a lot of services to be able
to answer a client request,
i.e. availability is broken by design
... and this ...
Service
Service
Service Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service
(a.k.a. Monolith) Request
By trying to avoid the aforementioned issues we ended up with cramming all
required functionality in one big service
i.e. the isolation is broken by design
... without ending up with this?
Let’s use the well-known best practices
• Divide & conquer a.k.a. functional decomposition
• DRY (Don’t Repeat Yourself )
• Design for reusability
• Layered architecture
• …
Unfortunately, …
Service A Service B Request
Due to functional design, Service A always needs backing from Service B to be able to answer a client request,
i.e. the isolation is broken by design
... this usually leads to this ...
Service
Request
Due to functional design we need to call a lot of services to be able
to answer a client request,
i.e. availability is broken by design
... and this ...
Service
Service
Service Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service
(a.k.a. Monolith) Request
By trying to avoid the aforementioned issues we ended up with cramming all
required functionality in one big service
i.e. the isolation is broken by design
... and in the end also often to this.
Welcome to distributed hell!
Caches to the rescue!
Service A Service B Request
Due to functional design, Service A always needs backing from Service B to be able to answer a client request,
i.e. the isolation is broken by design
Cach
e of
B
Break tight service coupling by caching data/responses
of downstream service
Caches to the rescue?
Do you really thinkthat copying stale data all over your system
is a suitable measure to fix an inherently broken design?
We have to re-learn design for distributed systems!
A works-out-of-the-box-in-all-contexts, just-add-water-and-stir,
three-bullet-point panacea for designing perfect bulkheads
You need lots of those …
... maybe some of those
Then it is a lot of hard work …
... and there is no silver bullet
Yet, a few guiding thoughts about bulkhead design …
Foundations of design • High cohesion, low coupling
• Separation of concerns
• Crucial across process boundaries
• Still poorly understood issue
• Start with • Understanding organizational boundaries
• Understanding use cases and flows
• Identifying functional domains (à DDD)
• Finding areas that change independently
• Do not start with a data model!
Short activation paths
• Long activation paths affect availability
• Increase latency and likelihood of failures
• Minimize remote calls per request
• Need to balance opposing forces
• Avoid monolith à clear separation of concerns
• Minimize requests à cluster functionality & data
• Caches sometimes help, but stale data as trade-off
Dismiss reusability
• Reusability increases coupling
• Reusability leads to bad service design
• Reusability compromises availability
• Reusability rarely pays
• Do not strive for reuse
• Strive for replaceability instead
Broadening the options ...
Core
Detect Treat
Prevent
Recover
Mitigate Complement
Isolation
Communication paradigm
Bulkhead
Communication paradigm
• Request-response <-> messaging <-> events <-> …
• Heavily influences resilience patterns to be used
• Also heavily influences functional bulkhead design
• Very fundamental decision which is often underestimated
µS
Request/Response : Horizontal slicing
Flow / Process
µS µS
µS µS µS
µS
Event-driven : Vertical slicing
µS µS
µS
µS µS
Flow / Process
Synchronous R/R vs. asynchronous events
• Decomposition • Vertically divide-and-conquer vs. horizontally go-with-the-flow
• Coordination • Coordination logic/services and orchestration vs. event chains and choreography
• Transactions • Built-in transaction handling vs. external supervision
• Error handling • Built into service vs. escalation/supervision strategy
• Separation of concerns • Multiple responsibilities service vs. single responsibility services
• Encapsulation • Domain logic distributed across services vs. domain logic in one place • Reusability vs. Replaceability
• Complexity • A draw …
The communication paradigm influences the functional service design a lot
and also the resilience patterns to be used
Example: order fulfillment • Simple order, credit card, non-digital items
• Add coupons incl. validation • Add promotions incl. usage notification • Add bonus card incl. purchase notification
• Customer accounts as payment type • PayPal as payment type
• Integrate digital music library • Integrate digital video library • Integrate e-book library
Design exercise – Part 1 Create a bulkhead design for the case study • Use one communication paradigm
• Synchronous request/response (e.g., REST) • Asynchronous messaging (e.g., Akka) • Asynchronous events (e.g., Pub/Sub)
• Assume incremental requirements • How many services do you need to touch • What about the functional isolation of the services • How big/maintainable are the resulting services • Take a few notes
Online Shop Checkout
Credit Card Provider
Warehouse System
Coupon Management
Campaign Management
PayPal
Loyalty Management
Accounts Receivables Music Library
E-Book Library Video Library
E-Mail Server
Customer pressed
“Buy now”
?
Order Fulfillment Service
Online Shop
Payment Service
Credit Card Provider
Shipment Service
Warehouse System
<Foreign Service> <Own Service>
Coupon Management
Promotion Campaign
Management Loyalty
Account Service
Payment Provider
PayPal
Loyalty Management
Accounts Receivables
Music Library
E-Book Library
Video Library
E-Mail Server
Coupon
Credit Card
Coordinate
Warehouse
Coordinate
Assets
Notify Cust.
PayPal
Coordinate
Order confirmed
Online Shop
Credit Card Provider
Warehouse System
<Foreign Service>
<Own Service>
Coupon Management
Campaign Management
Account service
Credit Card Service
Loyalty Management
Accounts Receivables
Music Library
E-Book Library
Video Library E-Mail Server
PayPal
PayPal Service
Warehouse Service
Promotion Service
Bonus Card Service
Coupon Service
Music Library Service
Video Library Service
E-Book Library Service
Notification Service
Payment authorized Digital asset provisioned
Payment failed
<Event>
Order fulfillment supervisor
Track flow of events Reschedule events in case of failure
Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that
Do not limit your design options upfront without an important reason
Wrap-up
• Today’s systems are distributed • Failures are not avoidable, nor predictable • Resilient software design needed
• Bulkhead design is • crucial for application robustness • poorly understood • massively underrated • different from traditional design best practices
• Communication paradigms broaden your bulkhead design options
We have to re-learn design for distributed systems
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com