+ All Categories
Home > Technology > Capacity Management for Web Operations

Capacity Management for Web Operations

Date post: 16-May-2015
Category:
Upload: john-allspaw
View: 27,123 times
Download: 3 times
Share this document with a friend
Popular Tags:
57
Capacity Management for Web Operations John Allspaw Operations Engineering
Transcript
Page 1: Capacity Management for Web Operations

Capacity Management for Web Operations

John AllspawOperations Engineering

Page 2: Capacity Management for Web Operations

the book I’m writing

Page 3: Capacity Management for Web Operations

???

Page 4: Capacity Management for Web Operations

Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

Page 5: Capacity Management for Web Operations

bugs (disguised as capacity problems)

edge cases (disguised as capacity problems)

security incidents

real capacity problems*

* (should be the last thing you need to worry about)

Things that can cause downtime

Page 6: Capacity Management for Web Operations

Capacity != Performance

Forget about performance for right now

Measure what you have right NOW

Don’t count on it getting any better

Page 7: Capacity Management for Web Operations

Thank You HPC Industry!

Automated Stuff

Scalable Metric Collection/Display

a lot of great deployment and management trickscome from them, adopted by web ops

Page 8: Capacity Management for Web Operations

Good Measurement

Tools

record and storemetrics in/outcustom metricseasily comparelightweight-ish

I

Page 9: Capacity Management for Web Operations

Clouds need planning too

Makes deployment and procurement easy and quick

But clouds are still resources with costs and limits, just like your own stuff

Black-boxes: you may need to pay even more attention than before

Page 10: Capacity Management for Web Operations

Metrics

System Statistics

Page 11: Capacity Management for Web Operations

Metrics“Application” Level

(photos processed per minute)

(average processing time per photo)

(apache requests)

(concurrent busy apache procs)

Page 12: Capacity Management for Web Operations

MetricsApp-level meets system-level

here, total CPU = ~1.12 * # busy apache procs (ymmv)

Page 13: Capacity Management for Web Operations

2400

photos per minute being uploaded right NOW (Tuesday afternoon)

Page 14: Capacity Management for Web Operations

Ceilings

the most amount of “work” yourresources will allow before degradationor failure

Page 15: Capacity Management for Web Operations

Forget Benchmarking

Page 16: Capacity Management for Web Operations

Find your ceilings

The End

what you have left

Page 17: Capacity Management for Web Operations

Use real live production data to find ceilings

Production: “it’s like a lab, but bigger!”

Page 18: Capacity Management for Web Operations

Like: database ceilings

replication lag: bad!

Page 19: Capacity Management for Web Operations

Ceilings

waiting on disk too much

sustained disk I/O wait for >40% creates

slave lag**for us, YMMV

Page 20: Capacity Management for Web Operations

35,000photo requests per second on a Tuesday peak

Page 21: Capacity Management for Web Operations

Safety Factors

Page 22: Capacity Management for Web Operations

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Page 23: Capacity Management for Web Operations

Safety Factors

webserver!

Page 24: Capacity Management for Web Operations

what you have left

“safe” ceiling

@85% CPU

Safety Factors

85% total CPU = ~76 busy apache procs

Page 25: Capacity Management for Web Operations

Safety FactorsYahoo Front Page

link to Chinese NewYearPhotos

(photo requests/second)

(8% spike)

Page 26: Capacity Management for Web Operations

Forecasting

Page 27: Capacity Management for Web Operations

Forecasting

Fictional Example:webservers

Page 28: Capacity Management for Web Operations

Forecasting

Fictional example: 15 webservers. 1 week.

peak of the week

Page 29: Capacity Management for Web Operations

...bigger sample, 6 weeks....isolate the peaks...

Forecasting

Page 30: Capacity Management for Web Operations

...”Add a Trendline” with some decent correlation...

Forecasting

not too shabby

now

Page 31: Capacity Management for Web Operations

Forecasting

15 servers @76 busy apache proc limit = 1140 total procs

when is this?

this will tell you when it isceiling

what you have left

Page 32: Capacity Management for Web Operations

Forecasting

(week #10, duh)

(1140-726) / 42.751 = 9.68

Page 33: Capacity Management for Web Operations

Writing excel macros is boring

All we want is “days remaining”, so all we need is the curve-fit

Forecasting Automation

Use http://fityk.sf.net to automate the curve-fit

Page 34: Capacity Management for Web Operations

Forecasting

Fictional Example:storage consumption

Page 35: Capacity Management for Web Operations

Forecasting Automation

actual flickr storage consumption from early 2005, in GB(ceiling is fictional)

this will tellyou when this is

Page 36: Capacity Management for Web Operations

Forecasting Automationcmd line scriptoutput

jallspaw:~]$cfityk ./fit-storage.fit

1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...

Page 37: Capacity Management for Web Operations

Forecasting Automation

(SAME)

fityk gave:

y = 0.786854x2 + 146.657x + 14147.4

( R2 = 99.84)

Excel gave:

y = 0.7675x2 + 146.96x + 14147.3

( R2 = 99.84)

Page 38: Capacity Management for Web Operations

Capacity Health

12,629 nagios checks

1314 hosts

6 datacenters

4 photo “farms”

farm = 2 DCs (east/west)

Page 39: Capacity Management for Web Operations

High and Low Water Marks

alert if higher

alert if lower

Per server, squid requests per second

Page 40: Capacity Management for Web Operations

A good dashboard looks something like...

type #limit/box

ceiling units

limit (total)

current (peak)

% peak

Est daysleft

www 20 80 busy procs

1600 1000 62.50% 36

shard db

20 40 I/O wait

800 220 27.50% 120

squid 18 950 req/sec 17,100 11,400 66.67% 48

(yes, fictional numbers)

Page 41: Capacity Management for Web Operations

Diagonal Scaling

Image processing machines

Replace Dell PE860s with HP DL140G3s

vertically scaling your already horizontal nodes

Page 42: Capacity Management for Web Operations

Diagonal Scalingexample: image processing

4 cores

8 cores

(about the same CPU “usage” per box)

Page 43: Capacity Management for Web Operations

~45 images/min @ peak

~140 images/min @ peak

(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals

Diagonal Scalingexample: image processing throughput

Page 44: Capacity Management for Web Operations

3008.4 Watts

1036.8 Watts

went from:

23 Dell PE860s

8 HP DL140 G3s

to:

1035 photos/min

1120 photos/min

(75% faster, even)

23Urack

8Urack

Diagonal Scalingexample: image processing

!!!

Page 45: Capacity Management for Web Operations

3.52

terabytes will be consumed today (on a Tuesday)

Page 46: Capacity Management for Web Operations

2nd Order Effects(beware the wandering bottleneck)

www

LB

www

memcacheddb search

running hot,so add more

Page 47: Capacity Management for Web Operations

2nd Order Effects(beware the wandering bottleneck)

www

LB

www www www

memcacheddb search

running great now,so more traffic!

now these run

hot

Page 48: Capacity Management for Web Operations

Stupid Capacity Tricks

Page 49: Capacity Management for Web Operations

Stupid Capacity Tricksquick and dirty management

DSHhttp://freshmeat.net/projects/dsh

[root@netmon101 ~]# cat group.of.servers

www100www118dbcontacts3admin1admin2

Page 50: Capacity Management for Web Operations

Stupid Capacity Tricksquick and dirty management

[root@netmon101 ~]# dsh -N group.of.servers

dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>

Page 51: Capacity Management for Web Operations

Stupid Capacity TricksTurn Stuff OFF

Disable heavy-ish features of the site(on/off switches)

We have 195 different things to disable in case of emergency.

Page 52: Capacity Management for Web Operations

Stupid Capacity TricksTurn Stuff OFF

uploads (photo)

uploads (video)

uploads by email

various API things

various mobile things

various search things

etc., etc.

Page 53: Capacity Management for Web Operations

Host your outage/status/blog page in more than one datacenter.

Tell your users WTF is going on, they’ll appreciate it.

Stupid Capacity TricksOutages Happen

Page 54: Capacity Management for Web Operations

Stupid Capacity TricksHit the Pause Button

Bake the dynamic into static

Some Y! properties have a big red button to instantly bake (and un-bake) at will

Page 55: Capacity Management for Web Operations

thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/

Page 56: Capacity Management for Web Operations

We’re Hiring!flickr.com/jobs

Come see me!

Page 57: Capacity Management for Web Operations

questions?


Recommended