+ All Categories
Home > Documents > Capacity Management for Web Operations John Allspaw Operations Engineering.

Capacity Management for Web Operations John Allspaw Operations Engineering.

Date post: 29-Mar-2015
Category:
Upload: ayla-summer
View: 227 times
Download: 7 times
Share this document with a friend
Popular Tags:
57
Capacity Management for Web Operations John Allspaw Operations Engineering
Transcript
Page 1: Capacity Management for Web Operations John Allspaw Operations Engineering.

Capacity Management

for Web Operations

John AllspawOperations Engineering

Page 2: Capacity Management for Web Operations John Allspaw Operations Engineering.

the book I’m writing

Page 3: Capacity Management for Web Operations John Allspaw Operations Engineering.

???

Page 4: Capacity Management for Web Operations John Allspaw Operations Engineering.

Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

Page 5: Capacity Management for Web Operations John Allspaw Operations Engineering.

bugs (disguised as capacity problems)

edge cases (disguised as capacity problems)

security incidents

real capacity problems*

* (should be the last thing you need to worry about)

Things that can cause downtime

Page 6: Capacity Management for Web Operations John Allspaw Operations Engineering.

Capacity != Performance

Forget about performance for right now

Measure what you have right NOW

Don’t count on it getting any better

Page 7: Capacity Management for Web Operations John Allspaw Operations Engineering.

Thank You HPC Industry!

Automated Stuff

Scalable Metric Collection/Display

a lot of great deployment and management trickscome from them, adopted by web ops

Page 8: Capacity Management for Web Operations John Allspaw Operations Engineering.

Good Measureme

nt Tools

record and storemetrics in/outcustom metricseasily comparelightweight-ish

I

Page 9: Capacity Management for Web Operations John Allspaw Operations Engineering.

Clouds need planning too

Makes deployment and procurement easy and quick

But clouds are still resources with costs and limits, just like your own stuff

Black-boxes: you may need to pay even more attention than before

Page 10: Capacity Management for Web Operations John Allspaw Operations Engineering.

Metrics

System Statistics

Page 11: Capacity Management for Web Operations John Allspaw Operations Engineering.

Metrics“Application” Level

(photos processed per minute)

(average processing time per photo)

(apache requests)

(concurrent busy apache procs)

Page 12: Capacity Management for Web Operations John Allspaw Operations Engineering.

MetricsApp-level meets system-level

here, total CPU = ~1.12 * # busy apache procs (ymmv)

Page 13: Capacity Management for Web Operations John Allspaw Operations Engineering.

2400

photos per minute being uploaded right NOW (Tuesday afternoon)

Page 14: Capacity Management for Web Operations John Allspaw Operations Engineering.

Ceilings

the most amount of “work” yourresources will allow before

degradationor failure

Page 15: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forget Benchmarking

Page 16: Capacity Management for Web Operations John Allspaw Operations Engineering.

Find your ceilings

The End

what you have left

Page 17: Capacity Management for Web Operations John Allspaw Operations Engineering.

Use real live production data

to find ceilings

Production: “it’s like a lab, but bigger!”

Page 18: Capacity Management for Web Operations John Allspaw Operations Engineering.

Like: database ceilings

replication lag: bad!

Page 19: Capacity Management for Web Operations John Allspaw Operations Engineering.

Ceilings

waiting on disk too much

sustained disk I/O wait for >40% creates

slave lag**for us, YMMV

Page 20: Capacity Management for Web Operations John Allspaw Operations Engineering.

35,000photo requests per second on a Tuesday peak

Page 21: Capacity Management for Web Operations John Allspaw Operations Engineering.

Safety Factors

Page 22: Capacity Management for Web Operations John Allspaw Operations Engineering.

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Page 23: Capacity Management for Web Operations John Allspaw Operations Engineering.

Safety Factors

webserver!

Page 24: Capacity Management for Web Operations John Allspaw Operations Engineering.

what you have left

“safe” ceiling

@85% CPU

Safety Factors

85% total CPU = ~76 busy apache procs

Page 25: Capacity Management for Web Operations John Allspaw Operations Engineering.

Safety FactorsYahoo Front Page

link to Chinese NewYearPhotos

(photo requests/second)

(8% spike)

Page 26: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

Page 27: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

Fictional Example:webservers

Page 28: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

Fictional example: 15 webservers. 1 week.

peak of the week

Page 29: Capacity Management for Web Operations John Allspaw Operations Engineering.

...bigger sample, 6 weeks....isolate the peaks...

Forecasting

Page 30: Capacity Management for Web Operations John Allspaw Operations Engineering.

...”Add a Trendline” with some decent correlation...

Forecasting

not too shabby

now

Page 31: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

15 servers @76 busy apache proc limit = 1140 total procs

when is this?

this will tell you when it isceiling

what you have left

Page 32: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

(week #10, duh)

(1140-726) / 42.751 = 9.68

Page 33: Capacity Management for Web Operations John Allspaw Operations Engineering.

Writing excel macros is boring

All we want is “days remaining”, so all we need is the curve-fit

Forecasting Automation

Use http://fityk.sf.net to automate the curve-fit

Page 34: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting

Fictional Example:storage consumption

Page 35: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting Automation

actual flickr storage consumption from early 2005, in GB

(ceiling is fictional)

this will tellyou when this is

Page 36: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting Automationcmd line script

outputjallspaw:~]$cfityk ./fit-storage.fit

1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...

Page 37: Capacity Management for Web Operations John Allspaw Operations Engineering.

Forecasting Automation

(SAME)

fityk gave:

y = 0.786854x2 + 146.657x + 14147.4

( R2 = 99.84)

Excel gave:

y = 0.7675x2 + 146.96x + 14147.3

( R2 = 99.84)

Page 38: Capacity Management for Web Operations John Allspaw Operations Engineering.

Capacity Health

12,629 nagios checks

1314 hosts

6 datacenters

4 photo “farms”

farm = 2 DCs (east/west)

Page 39: Capacity Management for Web Operations John Allspaw Operations Engineering.

High and Low Water Marks

alert if higher

alert if lower

Per server, squid requests per second

Page 40: Capacity Management for Web Operations John Allspaw Operations Engineering.

A good dashboard looks something like...

type #limit/box

ceiling units

limit (total)

current

(peak)%

peak

Est daysleft

www 20 80busy procs

1600 100062.50

%36

shard db

20 40I/O

wait800 220

27.50%

120

squid 18 950 req/sec

17,100

11,400

66.67%

48

(yes, fictional numbers)

Page 41: Capacity Management for Web Operations John Allspaw Operations Engineering.

Diagonal Scaling

Image processing machines

Replace Dell PE860s with HP DL140G3s

vertically scaling your already horizontal nodes

Page 42: Capacity Management for Web Operations John Allspaw Operations Engineering.

Diagonal Scalingexample: image processing

4 cores

8 cores

(about the same CPU “usage” per box)

Page 43: Capacity Management for Web Operations John Allspaw Operations Engineering.

~45 images/min @ peak

~140 images/min @ peak

(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals

Diagonal Scalingexample: image processing

throughput

Page 44: Capacity Management for Web Operations John Allspaw Operations Engineering.

3008.4 Watts

1036.8 Watts

went from:

23 Dell PE860s

8 HP DL140 G3s

to:

1035 photos/min

1120 photos/min

(75% faster, even)

23Urack

8Urack

Diagonal Scalingexample: image processing

!!!

Page 45: Capacity Management for Web Operations John Allspaw Operations Engineering.

3.52

terabytes will be consumed today (on a Tuesday)

Page 46: Capacity Management for Web Operations John Allspaw Operations Engineering.

2nd Order Effects(beware the wandering

bottleneck)

running hot,so add more

Page 47: Capacity Management for Web Operations John Allspaw Operations Engineering.

2nd Order Effects(beware the wandering

bottleneck)

running great now,so more traffic!

now these

run hot

Page 48: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity Tricks

Page 49: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity Tricksquick and dirty management

DSHhttp://freshmeat.net/projects/dsh

[root@netmon101 ~]# cat group.of.servers

www100

www118

dbcontacts3

admin1

admin2

Page 50: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity Tricksquick and dirty management

[root@netmon101 ~]# dsh -N group.of.servers

dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>

Page 51: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity TricksTurn Stuff OFF

Disable heavy-ish features of the site(on/off switches)

We have 195 different things to disable in case of emergency.

Page 52: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity TricksTurn Stuff OFF

uploads (photo)

uploads (video)

uploads by email

various API things

various mobile things

various search things

etc., etc.

Page 53: Capacity Management for Web Operations John Allspaw Operations Engineering.

Host your outage/status/blog page in more than one datacenter.

Tell your users WTF is going on, they’ll appreciate it.

Stupid Capacity TricksOutages Happen

Page 54: Capacity Management for Web Operations John Allspaw Operations Engineering.

Stupid Capacity TricksHit the Pause Button

Bake the dynamic into static

Some Y! properties have a big red button to instantly bake (and un-bake) at will

Page 55: Capacity Management for Web Operations John Allspaw Operations Engineering.

thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/

Page 56: Capacity Management for Web Operations John Allspaw Operations Engineering.

We’re Hiring!flickr.com/jobs

Come see me!

Page 57: Capacity Management for Web Operations John Allspaw Operations Engineering.

questions?


Recommended