How to address operational aspects effectively with Agile practices - Matthew Skelton - Agile In The...

Post on 15-Jan-2017

1,473 views 2 download

transcript

How to address operational aspects effectively with Agile practices

Agile in the City – 20th November 2015#agileinthecity

Matthew SkeltonSkelton Thatcher Consulting@matthewpskelton

“Operational Features”

how to develop and test

prioritisation techniques

collaboration approaches

availability is the best feature

transforming technology and teams

Cloud, Agile, DevOps

high impact expertise

transaction reporting

credit reference

FOREX

online payments

Operational Features

“the properties of a system which make it work well in

Production”

Not PIMP MY RIDE

MORE

Greasy Mechanic

Not PIMP MY RIDE

MORE

Greasy Mechanic

Terminology

what happened to NFRs?(non-functional requirements)

Non-Functional Functional

language impact

non-starternon compos mentis

non-compete

nonsense !

holistic product view

How did we get to this?

admission: IT folk have been guilty of making operational

features quite scary & mysterious

long lists of requirementscrazy test plans

poor explanation of needsfailure to engage stakeholders

gold-plating

de-mystify operational features

better approach

pragmatic and effective

rapid, safe, valuable

“the properties of a system which make it work well in

Production”

Why value Operational Features?

downtime:

$$$reputation

($$)

non-linear increase in complexity and problems

Internet of Things

we can no longer deal manually with the scale/volume

of potential problems

agility and response to incidents

remote car hacking:

security as an operational feature

“We have ‘cloud’ now”

(HA + DR + Backup + Metrics + Diagnostics + …)

think:"when it fails, how will we recover?“

it will fail

How do we develop and test Operational Features?

defined features

testable and measurable

ahead lie the ‘ilities’...

1. What2. How to test

Operational Hooks

Deployment Pipeline

Configurability

re-read config (SIGHUP)

text files in version control

inject settings – no ‘black boxes’

toggle features via config

“Postcode lookup unavailable”

better UX

Deployability

immutable artefacts

concurrent releases (SxS)

symlinks

rapid

scriptable

simple failure modes

Maintainability

holding page as MVP!

live system component diagrams

modularity

ability to upgrade

version numbering (SemVer?)

BasketItemAdded

grep BasketItem

logging for insights

Testability

every component has a /health endpoint

stubbed/mocked/faked endpoints

test things individually

Recoverability

asynchronous service start

expect services to be erroring

logs are not wiped (rotated: okay)

avoid flooding logs

no nasty zombies after failures

MTTR more important than MTBF** for most kinds of F

Performance

run key 'hotspot' areas early

use a deployment pipeline

‘critical path’

early pipeline tests act as a barometer for later

performance problems

derive transit time metrics

Monitorability

stream of metrics

transaction tracing

Resilience

assume missing or failing

Chaos Monkey

don’t crash on HTTP 503

Saboteur+

deployment pipeline

Scalability

concurrent workers

queues and bottlenecks

throttling is your friend

Security and ‘securability’

securability by practice

SSL certs & HEARTBLEED

Gauntlt+

deployment pipeline

Availability

“available but unusable"

synthetic transactions

special HTTP header: trigger additional metrics/reporting

How the organisation affects Operational Features

Budgets

bonuses:

story points delivered

tickets closed

Capex vs Opextax breaks

avoiding the Capex/Opex evil

Developers seen as more valuable than Ops people

3x hiring bonus for Devs (!)

improved awareness in product teams

share ownership and decision making

features

end-user

operationalend-user

single product backlog

Product Owner on call for incidents

tricky!

high degree of maturity

honesty about the product

Product Owner and Tech Lead are both on the hook for

outages

15-30% ‘tax’ on product budget for operational aspects

AVOID

‘user features’ always taking precedence over

‘operational features’

How to evaluate Operational Features vs User Features

treat Ops team folk as another user persona

alternatives to User Stories?

NOT:

"as a logging subsystem, I want..."

Metrics

Live: downtime, A/B for operational aspects (speed)

Pre-live: time spent re-deploying

Metrics for better conversations

metric-ify your delivery and test infrastructure

99.99% uptime, but 20 redeployments every time

Heuristics for operational features

30% of total product budget

30% of dev team time

Improving operational awareness

Run Book Collaboration

Run Book•Detailed description of how the system operates•Maintenance•Repair•Error recovery

Run Book / Ops Manual• 1 Table of Contents

• 2 System Overv iew • 2.1 Serv ice Overv iew• 2.2 Contr ibu t ing Appl ica t ions, Daemons, and Windows Serv ices• 2.3 Hours of Opera t ion• 2.4 Execut ion Des ign• 2.5 In f ras t ructure and Network Des ign• 2.6 Res i l i ence, Fau l t To le rance and High -Ava i lab i l i ty• 2.7 Throt t l i ng and Par t ia l Shutdown• 2.8 Requ i red Resources• 2.9 Expected Tra f f ic and Load

• 2.9 .1 Hot or Peak Per iods• 2.9 .2 Warm Per iods• 2.9 .3 Cool or Qu ie t Per iods

• 2.10 Env i ronmenta l D i f fe rences• 2.11 Too ls

• 3 Secur i ty and Access Contro l

• 4 System Conf igurat ion • 4.1 Conf igura t ion Management

• 5 System Backup and Restore • 5.1 Backup Requ i rements

• 5.1 .1 Spec ia l F i l es

• 5.2 Backup Procedures• 5.3 Restore Procedures

• 6 Moni tor ing and Aler t ing • 6.1 Er ror Messages

• 6.2 Events• 6.3 Hea l th Checks• 6.4 Other Messages

• 7 Operat ional Tasks • 7.1 Deployment• 7.2 Batch Process ing• 7.3 Power Procedures• 7.4 Rout ine Checks

• 7.4 .1 Sys tem Rebu i l ds

• 7.5 Troubleshoot ing

• 8 Maintenance Tasks • 8.1 Ma in tenance Procedures

• 8.1 .1 Patch ing • 8.1 .1 .1 No rma l C y c l e

• 8.1 .1 .2 Ze ro -D ay Vu l ne r ab i l i t i e s

• 8.1 .2 GMT/BST t ime changes• 8.1 .3 Cleardown Act i v i t i es

• 8.1 .3 .1 Log R o t a t i on

• 8.2 Test ing • 8.2 .1 Techn ica l Tes t ing• 8.2 .2 Pos t -Dep loymen t

• 9 Fai lure and Recovery Procedures • 9.1 Fa i lover• 9.2 Recovery• 9.3 Troubleshoot ing Fa i lover and Recovery

• 10 Contact Detai ls

Run Book / Ops Manual2.1 Service Overview2.2 Contributing Applications, Daemons, and Windows Services

2.3 Hours of Operation

2.4 Execution Design2.5 Infrastructure and Network Design

2.6 Resilience, Fault Tolerance and High-Availability

2.7 Throttling and Partial Shutdown

2.8 Required Resources

2.9 Expected Traffic and Load

Run Book collaborationDev team is responsible for the first draft

“But I know nothing about Production!”

Encourages collaboration with Ops team

Will Gray

not documentation

build trust and understanding

automate more over time

http://runbookcollab.info/

choose tools that encourage collaboration

http://rashidkpc.github.io/Kibana/images/screenshots/searchss.png

“How does [the use of] this tool help people to collaborate*?”

* Work together, at the same keyboard/screen

‘How to choose tools for DevOps and Continuous Delivery’

http://bit.ly/ChooseDevOpsTools

test early and often for operational readiness

operational readinessnetwork testingsecurity testing

performance testingauxiliary infrastructure testing:

monitoringlog aggregation

small set of rapid ‘weathervane’ tests for early warning

Network testingiTrinegy network emulators•Scripted setup and automated test runs•http://www.itrinegy.com/

Saboteur: •Network fault injection tool•https://github.com/tomakehurst/saboteur

Security testingGauntlt: http://gauntlt.org/

SSL certsHTTPSQL injection…

# nmap-simple.attack

Feature: simple nmap attack to check for open ports

Background:

Given "nmap" is installed

And the following profile:

| name | value |

| hostname | example.com |

Scenario: Check standard web ports

When I launch an "nmap" attack with:

"""

nmap -F <hostname>

"""

Then the output should match /80.tcp\s+open/

Then the output should not match:

"""

25\/tcp\s+open

"""

When I launch an "nmap" attack with:

"""

nmap -F <hostname>

"""

Then the output should match

/80.tcp\s+open/

Deployment pipelineMake operational testing visible

holistic product view

MVP: ‘service unavailable’ page

test early for operational features

using a deployment pipeline

single product backlog:

(user) features +

(operational) features

availability is the best feature

further reading

operabilitybook.comoperationalfeatures.com

thank you

http://skeltonthatcher.com/enquiries@skeltonthatcher.com

@SkeltonThatcher

+44 (0)20 8242 4103