Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | igor-moochnick |
View: | 4,761 times |
Download: | 0 times |
Best Practices in building scalable cloud-ready Service based systems
Discussion
Igor MoochnickIgorShare [email protected]
Blog: www.igorshare.com/blog
Is this system scalable?
What is scalability?
• Increase resources -> Increase performance proportionally to the amount of added resources
• Increase performance -> more units of works
$10,000
machine
$1000machin
e
Scale-up And Scale-outVo
lum
e
$500machine
# MachinesScale Up
$500machin
e
$500machin
e
$500machin
e
Volu
me
$500machin
e
Scale Out
DNSWWW
Is this system scalable?
Is this system scalable?
Is this system scalable?
Is this system a scalable MADNESS?
Here is the gun. Go kill yourself!
Strong
Eventual
Optimistic
Missile Launch
Address Change
Stock Ticker
Now
In the Future
Consistency Level Changes are Visible Example
Maybe in the Future
Some Useful Definitions
Exactly Once
At Least Once
At Most Once
Bank Transfer
Streaming Video
No loss, no duplicates
No loss, duplicates
Assurance Message Delivery Example
Loss, no duplicates
Best Effort Stock TickerLoss, duplicates
Consistency Levels
Message Assurances
Where did you start? Where did you end up?Shared State
ACID Transactions
Partitioned, Replicated State
Eventual Consistency
Exactly Once Messaging Best Effort Messaging
Machine Loss is a Catastrophe
Keep Processes Running
Machine Loss is Business As Usual
Recovery-Oriented Computing
The InfrastructureDeveloper's Experience
The law
• The least scalable component of your system becomes a bottleneck for the whole system
Recipe ingredients (Amazon guidelines)• Autonomy• Asynchrony• Controlled concurrency• Controlled parallelism• Decentralize• Decompose into small well-understood building blocks• Failure tolerant• Local responsibility• Recovery Built-in• Simplicity• Symmetry
Key principals
• Things fail all the time!• Machines
– Disposable– Nameless– Self assembled
• State management– Caching– Loose consistency– Relax isolation
• Redundancy• Partitioning
• Loosely coupled messaging• Best effort• Message loss• Retries• Self monitors• Self heals• Designed to expect failures• Continue to work seamlessly
during the failure
Application Development Patterns• Architecture
– Choose a high-level framework– Keep service and hosting code separate– Partition
• Design– Use loose coupling– Use caches and stale data– Have just a few simple recovery paths– Be topology-independent– Be hardware-indepedent
Challenges Of Scalability
• How do I ensure incoming requests are processed at the right location?– Partition on service-specific input– Dynamically route to correct node– Fail over seamlessly
• How do I manage state inside my service?– Take a hard look at consistency requirements– Aggressively cache and use transient data– Partition the Storage Tier
ACID vs. BASE
• ACID– Atomic– Consistent– Isolated– Durable
• Modern BASE-based systems– Basically Available– Soft-state (or scalable)– Eventually consistent
What is the problem?
• Only two of three:– Strong Consistency
• All clients see the same view during updates– High Availability
• Some data replica is always available despite failures
– Partition tolerance• All the properties hold even if partitioned
Techniques
• Expiration based caching: AP• Quorum / majority algorithms: PC• Two-phase commit: AC
Scaling data in 3 steps
• Partitioning• Routing• State management
Solving the data congestion
• Throttling (especially on startup after failure)• Denormalization• Scale vs. Performance• Fault Tolerance and recoverability• Geo-distribution• Content distribution providers (like Akamai)
Fault tolerance
• Throttling incoming traffic• Limit retries• Server failover• Data center failover• Consider using queues
Monitoring
• Monitor data about what the user sees – this is what is most important
• Make sure not to overdo – kills the components you rely on
• Be frugal– Built in counters and monitor the trends - can help you to
predict the spikes and allocate on demand extra resources
Monitoring
• Availability• Performance• Alerts• Auto throttling• Capacity thresholds• Load• Transactions• Should measure
realistic/relevant actions and behavior!
Importance
Diagnosing & Logging
• Non-blocking• Asynchronously• Size – can be too big (there is “too much of a good thing”)
– Have control over “what” and “how much”• Performance hit (“do no harm”)• Should not become a bottleneck• Be careful what you log
– Horizontally– Vertically
• Should be able to replay logs and correlate the requests– <time><correlate-id><node-id><action><data><result>
Troubleshooting the distributed systems• Decoupling• Role isolation• Single box
• Allow to separate the functionality from the rest of the system
• Allow to run all from a single box• Have stubs and simulators• Be able to “replay” the logs
Deployments
• Deployment packaging• Rolling out gradually or
atomically• Automatic deployments• Staging environment• Building confidence with
real customer data• Rolling back• Security trumps feature• Load balancing
• Consider linear scale• Keep IT in mind• Upgradability
Deployment
• It’s hard• It’s hard to make it right• Automate everything – simplifies the repeatability• Version Forward/Backward compatible• Rolling upgrade and rollback• Be nice to your friends• Know and manage your environments
• Compensate for gradual system recovery• Clean queues
Resources
• Availability & Consistency presentation of Amazon CTO Dr. Werner Vogel http://www.infoq.com/presentations/availability-consistency
• Microsoft PDC’08 Presentationshttps://sessions.microsoftpdc.com/timeline.aspx
Q&A
Thank you!