+ All Categories
Home > Documents > The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S....

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S....

Date post: 14-Dec-2015
Category:
Upload: nichole-willow
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
The Case for Drill- Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1
Transcript
  • Slide 1

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1 Slide 2 Cloud Services Cheap Convenient Reliable 2 Slide 3 Yahoo Mail Disruption Hardware failures Wrong failover Disruptions Some users could not access Some users saw wrong notifications Several days to recover 3 Slide 4 Outlook Disruption Hardware failures Caching server Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4 Slide 5 Cloud Outages 5 Outage Amazon EBS Gmail App Engine Skype Google Drive Outlook Yahoo Mail Root Event Network misconfig Upgrade event Power failure Overload Network bug Caching failure Hardware failures Supposedly tolerable failure Network partition Servers offline 25 % machines offline 30 % nodes failed Network offline Failover to backend Servers offline Incorrect Recovery Re-mirroring storm Bad request routing Bad failover Positive feedback loop Timeout during failover Request flooding Buggy failover Major Outage Clusters collapsed All routing servers down All user app were degraded Almost all nodes failed 33 % requests affected 7-hour outage 1 % of users affected Slide 6 Journey of Cloud Dependability Research 6 Slide 7 Fault-Tolerant Systems 7 Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Slide 8 Offline Testing Thoroughly verify recovery mechanism 8 Slide 9 Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. Mini cluster that represents production runs Testing and production environment is different Cluster, workload, failure 9 Mini cluster Production run Real workload Test workload Slide 10 Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. Mini cluster that represents production runs Testing and production environment is different Cluster, workload, failure Orders of magnitude different in scale Facebook used 100 machines to mimic 3000-machine production run[2011] Small start-ups forego the luxury Many tests are much smaller than this 10 Slide 11 Diagnosis Help administrators to point out and reproduce causes of outages BUT Post-mortem, not prevent disruptions Passive approach, wait outages happen before diagnosis 11 Slide 12 Online Testing and Failure Drills 12 Requests Customers Test Administrators Inject failures online Users outnumber testers Real deep scenarios Slide 13 A Missing Piece 13 Boss, let do inject failures online using Chaos Monkey Hmm EmployeeBoss Dear beloved customers, Thank you for trusting our services, but we accidentally lose your data because the failure drills that we run... Slide 14 Future of Failure Drill 14 Drill-ready cloudsCurrent Drill A team of engineers standing by Slide 15 Drill-Ready Cloud Computing Automatic failure drill and automatic cancellation Safe, efficient, easy manner Ideally, no engineering effort required 15 Slide 16 Drill-Ready Cloud Computing 16 Administrator Drill-Ready System Drill Mode Drill Spec Kill 25 % If it disrupts revert back Drill-ready cloud computing Systems take care failure injection and cancellation Drill-ready cloud computing Systems take care failure injection and cancellation Slide 17 Outline Safety Efficiency Usability Generality 17 Slide 18 Safety Learn about failure implications without suffering through them Learn whether data can be lost But not lose the data Learn whether SLA can be violated But not violate it for long time 18 Slide 19 Safety Solutions Normal and drill states 19 Not drill aware Slide 20 Safety Solutions Normal and drill states 20 Normal TopologyDrill Topology Maintaining 2 states Revert back to normal state easily Normal and drill states The first most important thing for drill-ready clouds Normal and drill states The first most important thing for drill-ready clouds Slide 21 Safety Solutions Drill state isolation Self cancellation Real failures during the drill Drill master and drill agent Drill master command agents What if network partition? Agents are in limbo state Self cancellation when agents cannot contact master 21 Slide 22 Safety Solutions Drill state isolation Self cancellation Safe drill specification Drill specification 22 Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back Safe drill specification Check whether the specification can run safely Safe drill specification Check whether the specification can run safely Slide 23 Efficiency Failures trigger data migration Monetary cost Bandwidth Storage space System performance Affect users 23 Slide 24 Efficiency Solutions Low-overhead drill setup and cleanup Do we need to do real key re-balance? Depends on the objective of the test 24 [11-20] [21-30] [1-10][31-40] [41-50][51-60] [41-45] [46-50] [11-15] [16-20] Yes, if we want to see background re-balance impact Read / Write data SLA okay? Slide 25 Efficiency Solutions Low-overhead drill setup and cleanup Do we need to do real key re-balance? Depends on the objective of the test 25 [16-30] [31-45] [1-15][46-60] No, if we want to measure performance, when we lose 2 nodes Read / Write [46-60] SLA okay? No key [11] Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Slide 26 Efficiency Solutions Low-overhead drill setup and cleanup Cheap drill specification Smarter and cheaper drill specification 26 If replication is 50 % correct assume that the rest are correct Stop half way and report success Replicating progress status Slide 27 Usability Solutions Declarative drill specification language 27 Need declarative language Describe results Easy to read and write Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill If recovery is 50% good Stop the drill Report success Slide 28 Generality Solutions Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28 Slide 29 Conclusion Drill-ready cloud computing New reliability paradigm Sketching a first draft We want your FEEDBACK 29 Slide 30 Thank You 30 http://ucare.cs.uchicago.edu


Recommended