+ All Categories
Home > Presentations & Public Speaking > DevSecCon Asia 2017 Sergiu Bodiu: From resilient to antifragile

DevSecCon Asia 2017 Sergiu Bodiu: From resilient to antifragile

Date post: 19-Mar-2017
Category:
Upload: devseccon-limited
View: 132 times
Download: 1 times
Share this document with a friend
38
Join the conversation #devseccon From Resilient to Antifragile - Chaos Engineering Primer By @Sergiu_Bodiu @Pivotal Platform Architect Asia Pacific & Japan
Transcript

Join the conversation #devseccon

From Resilient to Antifragile - Chaos Engineering Primer

By @Sergiu_Bodiu @PivotalPlatform Architect Asia Pacific & Japan

Singapore Spring UserGroup

DevOpsDays SingaporeConference

Imagine having an idea in the morning and shipping in the evening.

@Sergiu_Bodiu

Anewwaytolookatorganizations

3

Fragile:Atriskoftotalfailure/financialruinResilient:Takesdamage,avoidstotalfailure,recovers

Robust:Absorbsuncertainty,repelsblows,avoidsdamage

Antifragile:Respondstostressbymutating,maintainsfitnessforpurpose.IdentityChange.

@Sergiu_Bodiu

Riskmanagement

4

Thenewnormal:

FROMRESILIENTTOANTIFRAGILE

@Sergiu_Bodiu

DistributedSystemsComplexity

5

Complexityislikeaddiction…Itcomesonslowly,formingweakbondsthatyoucanbarelyfeel.

Butasitcontinues,thebondsstrengthenquietlyuntiltheycalcifyandbecomehardtobreak.

@Sergiu_Bodiu

SoftwareisSinglePointofFailure

6

Instapaper-outage-cause-recoveryafter31hoursgitlab.comwasdownforabout18hoursandalsolostproductiondataroughly5,000projects,5,000commentsand700newuseraccounts.

RootCauseAnalysis:WhilecomponentfailuressuchasNETWORK,STORAGE,SERVER,HARDWARE,andPOWERfailuresareanticipatedandthusguardedwithextraredundancies.

@Sergiu_Bodiu8 Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process

Dejirafication

@Sergiu_Bodiu

ChaosEngineering

9

ChaosEngineering-Disciplineofexperimentingonadistributedsysteminordertobuildconfidenceinthesystem’s

capabilitytowithstandturbulentconditionsinproduction.

http://principlesofchaos.org

@Sergiu_Bodiu

Backups

10

"Backupsalwayssucceed.It'stherestoresthatfail.

Testyourbackupsbypracticingrestores!"

UsingChaosMonkey

@Sergiu_Bodiu

SomeoutagesintheRegion

13

SingTelfinedarecord$6mforBukitPanjangexchangefire;

Telstragoesdownagain,peoplecan'tdrinkbeerorcatchUbers

AmazonWebServicesoutagecausesAustralianwebsitechaos

@Sergiu_Bodiu

NetflixSimianArmy

14

The Simian Army is a suite of tools for keeping your cloud operating in top form.

https://github.com/Netflix/SimianArmy

@Sergiu_Bodiu

ChaosMonkey

15

•Activeduringnormalworkinghours•Breakthingsinproduction•Designbettersoftwareservices•Embracingfailure

http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html

@Sergiu_Bodiu

OtherMonkeys

16

•LatencyMonkey•JanitorMonkey•ConformityMonkey•SecurityMonkey•DoctorMonkey

@Sergiu_Bodiu

PrinciplesofChaos

17

1. BuildaHypothesisaroundSteadyStateBehavior2. VaryReal-worldEvents3. RunExperimentsinProduction4. AutomateExperimentstoRunContinuously

TIP:Intentionallybreakthings,comparemeasuredwithexpectedimpact,andcorrectanyproblemsuncoveredthisway.Chaos Engineering Whitepaper 2016

@Sergiu_Bodiu

Hypothesize

18

Buildahypothesisaroundsteadystatebehavior.Steadystatecharacterizationsthatarevisibleattheboundaryofthesystem,whichdirectlycaptureaninteractionbetweentheusersandthesystem.

TIP:UtilisationisVirtuallyUselessasaMetric!

@Sergiu_Bodiu

VaryEvents

19

• terminatevirtualmachineinstances• injectlatencyintorequestsbetweenservices• failrequestsbetweenservices• failaninternalmicroservice• makeanentireregionunavailable

TIP:Selectonlyasubsetofusers

@Sergiu_Bodiu

Experiment

20

92%ofcatastrophicsystemfailuresweretheresultofincorrecthandlingofnonfatalerrors.Itissimplynotpossibletofullyreproducetheentirearchitectureandrunanendtoendtest.

TIP:Customersdon'tbehaveasyourJMeterscript.

https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf

@Sergiu_Bodiu

Experiment

21

92%ofcatastrophicsystemfailuresweretheresultofincorrecthandlingofnonfatalerrors.Itissimplynotpossibletofullyreproducetheentirearchitectureandrunanendtoendtest.

TIP:Customersdon'tbehaveasyourJMeterscript.

https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf

@Sergiu_Bodiu

Automate

22

• Distributedsystemschangescontinuouslyovertime.• Engineersmodifythebehaviorofexistingservices,addnewservices.

• Engineersarechangingruntimeconfigurationparameters,upgradingandpatchingsystems

TIP:Dependingonthecontext,changetherateofeachexperiment.

@Sergiu_Bodiu

ChaosEngineeringWhitepaper2016

23

Buildahypothesisaroundsteadystatebehavior.Varyrealworldevents.Runexperimentsinproduction.Automateexperimentstoruncontinuously.

@Sergiu_Bodiu

Theimportanceofreliability

24

Don'ttrustclaimssystemsmakeaboutthemselves&theirdependencies.Verify

bybreaking.

@Sergiu_Bodiu

Locustdemo

25

Locustisanopen-sourcePythonloadtestingframework.• Defineuserbehaviourincode• Canexecuteend-to-endusertestwithsessionsandcookies.• Expandstomultipleslavestoincreaseloadcapacity• AllowsfordistributeduserpathsbasedonpercentagesGatlingisanopen-sourceScalaloadtestingframework• Highperformance• Ready-to-presentHTMLreports• Scenariorecorderanddeveloper-friendlyDSL

@Sergiu_Bodiu

LessonsLearned

26

• Don’twaitsolongtostartloadtesting.• Theconversationsdrivenewrequirements.• Changingarchitecturelastminuteisextremelydangerous.• Thisisincrediblehardunderpressure.• BuildrelationwithNetworkingTeam,DatabaseTea,ThirdPartyPartners,Vendorsetc..

• MakeeverythingAsynchronous(EmbraceFailure,BackgroundTasks,Retry,Idempotence)

@Sergiu_Bodiu

Demo

27

@Sergiu_Bodiu

ChaosLemurdemo

28

ChaosLemurisanalternativetoChaosMonkey(whichwasdesignedforAWS)thatwasdesignedwithPCFinmind.

@Sergiu_Bodiu

Cleanyourprocess

29

Fromleanthinkingperspective:managingtheinventoryisanon-value-addingactivity

Culture>Principles>Tools

@Sergiu_Bodiu

TestingPyramid

30 https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/

@Sergiu_Bodiu

Principles

31

Anydeveloperbuildingapplicationswhichrunasaservice.Opsengineerswhodeployormanagesuchapplications.https://12factor.net:

Anyoneworkinginsoftwarethatwritestestsormaintains

continuousintegrationpipelines.http://www.10factor.ci

@Sergiu_Bodiu

AgileManifesto

32

TheAgilemovementisnotanti-methodology,infact,manyofuswanttorestorecredibilitytothewordmethodology.

Now,abiggergatheringoforganizationalanarchistswouldbehardtofind,sowhatemergedfromthismeetingwassymbolic.

@Sergiu_Bodiu

FurtherReading

33

https://www.infoq.com/br/presentations/exercising-failure-at-netflixhttps://www.infoq.com/podcasts/failure-as-a-service

PeterAlvaro:OrchestratedChaos:ApplyingFailureTestingResearchatScaleMathiasLafeldt:WritingyourfirstpostmortemAdrianColyerSimpleTestingCanPreventMostCriticalFailures

Join the conversation #devseccon

Thank You @sergiu_bodiu

Let's Build Something MEANINGFULL

@Sergiu_Bodiu

ChaosEngineering

35

ChaosEngineering-Disciplineofexperimentingonadistributedsysteminordertobuildconfidenceinthesystem’s

capabilitytowithstandturbulentconditionsinproduction.

http://principlesofchaos.org

@Sergiu_Bodiu

BlueprintforlivinginaBlackSwanworld.

36

Antifragile,andonlytheantifragile,will

Makeit.

The Three R’s of Enterprise Security: Rotate, Repave, and Repair


Recommended