+ All Categories
Home > Documents > Cloud fail scaling to infinity but not beyond

Cloud fail scaling to infinity but not beyond

Date post: 07-Nov-2014
Category:
Upload: kunal-johar
View: 146 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
55
CLOUDFAIL SCALING TO INFINITY – BUT NOT BEYOND Kunal Johar
Transcript
Page 1: Cloud fail   scaling to infinity but not beyond

CLOUDFAILSCALING TO INFINITY – BUT NOT

BEYONDKunal Johar

Page 2: Cloud fail   scaling to infinity but not beyond

MARCH 14, 2013π Day

Page 3: Cloud fail   scaling to infinity but not beyond

What would you do?

• You take your senior design project to the next level

• You have some traction – 10-15 people a week using it

• A game-changing opportunity hits you in the face

• You need to scale to tens of thousands of users per week

Page 4: Cloud fail   scaling to infinity but not beyond

Act as If

• Scaling is no big deal right?

• Amazon’s Elastic Cloud; Rackspace’s Infinite Capacity

• 50,000 is a small number even in O(N^2)

• I’m sure I can figure it out

Page 5: Cloud fail   scaling to infinity but not beyond

“We are counting on you”

• Our organization depends on this software for our annual operating budget

• This year was a total disaster. Multi-week outages.

• We need you to tell us that this will work, that the system won’t go down, no matter how much traffic we send to it

Page 6: Cloud fail   scaling to infinity but not beyond

No Problem

• “The old vendor was amateur hour”

• We’ll distribute the load across multiple servers

• We’ll load test

• We’ll scale up

• DON’T WORRY

Page 7: Cloud fail   scaling to infinity but not beyond

MAY 20, 2013Paperwork Signed – Now the Challenge Begins

Page 8: Cloud fail   scaling to infinity but not beyond

Our Software Does it all (soon)

• It was a Brutal Summer

• We had 12 weeks to learn, architect, and build what ended up being 1800 hours worth of features

• The margin for error was Zero

• We also had to make sure our system would scale to meet the super-surge of traffic in January

Page 9: Cloud fail   scaling to infinity but not beyond

Full Team Buy-In

• The stakes were known to everyone.

• If we succeeded, we’d pivot ourselves to the top of the market.

• If we failed, half the team would be out of work

• Our client called failure “Mutually Assured Destruction”

Page 10: Cloud fail   scaling to infinity but not beyond

Full Team Buy-In

• The stakes were known to everyone.

• If we succeeded, we’d pivot ourselves to the top of the market.

• If we failed, half the team would be out of work

• Our client called failure “Mutually Assured Destruction”

Page 11: Cloud fail   scaling to infinity but not beyond

SEPTEMBER 2, 2013Lot’s of Overtime, Heat, Stress, Anxiety. But we did it.

Page 12: Cloud fail   scaling to infinity but not beyond

Memo to Developers

Page 13: Cloud fail   scaling to infinity but not beyond

Load Test or Beta Test?

• From the September 1 Launch date; until even today we have been hit with new feature requests

• “Oh! I forgot about that – but it’s really important”

• How do you balance engineering priorities vs feature priorities?

Page 14: Cloud fail   scaling to infinity but not beyond

How to Construct a Load Test

• Write custom scripts that simulate real users using your app• Selenium Web Driver + Sauce Labs• Browser Mob (Neustar)• Load Impact

• Write a custom handler that simulates the user payload• Loader.io

Page 15: Cloud fail   scaling to infinity but not beyond

Our Loader.io Script PayLoad

• POST 100 KB of data

• Simulate Save to Database

• GET 100 KB of data from Database

Page 16: Cloud fail   scaling to infinity but not beyond

The Actual Load Test

Page 17: Cloud fail   scaling to infinity but not beyond

300+ Users Per Second!

• Whoo hoo!

• 300 users per second must mean what? Thousands of users per minute!

• I report to client a very successful load test and put the matter towards some wishful thinking

Page 18: Cloud fail   scaling to infinity but not beyond

SURVIVORSHIP BIAShttp://youarenotsosmart.com/2013/05/23/survivorship-bias/

Page 19: Cloud fail   scaling to infinity but not beyond

Survivorship Bias

The misconception

You should focus on the successful if you wish to be successful

The truth

When failure becomes invisible, the difference between failure and success my also become invisible

Page 20: Cloud fail   scaling to infinity but not beyond

Survivorship Bias

• “A Cabal of Geniuses” assembled at the request of the White House

• Top women mathematicians (human computers), Nobel Prize Winners, researchers formed the Statistical Research Group

Page 21: Cloud fail   scaling to infinity but not beyond

Keeping Airlines in the Sky

• At its lowest; survivability of a WWII bomber was 50% on a mission

• “Ghosts already” is how airmen were known

• “How, the Army Air Force asked, could they improve the odds of a bomber making it home”

Page 22: Cloud fail   scaling to infinity but not beyond

Armor

• Military commanders inspected the planes that made it back

• Ideally they could put armor on the whole plane, but then it wouldn’t fly

• Tons of bullet holes in key areas of the fuselage, wings, near the gunners

• The army was about to add plating to these parts of the bombers

Page 23: Cloud fail   scaling to infinity but not beyond

Armor

• The scientists successfully argued “Survivorship Bias”

• Stop looking at the survivors – it is the other parts of the plane that need more armor!

Page 24: Cloud fail   scaling to infinity but not beyond

WHAT IS “CLOUDSCALE”

Page 25: Cloud fail   scaling to infinity but not beyond
Page 26: Cloud fail   scaling to infinity but not beyond
Page 27: Cloud fail   scaling to infinity but not beyond
Page 28: Cloud fail   scaling to infinity but not beyond
Page 29: Cloud fail   scaling to infinity but not beyond
Page 30: Cloud fail   scaling to infinity but not beyond
Page 31: Cloud fail   scaling to infinity but not beyond
Page 32: Cloud fail   scaling to infinity but not beyond
Page 33: Cloud fail   scaling to infinity but not beyond
Page 34: Cloud fail   scaling to infinity but not beyond
Page 35: Cloud fail   scaling to infinity but not beyond
Page 36: Cloud fail   scaling to infinity but not beyond
Page 37: Cloud fail   scaling to infinity but not beyond
Page 38: Cloud fail   scaling to infinity but not beyond

LOLWE DON’T DO THAT

Zack’s first comment as I concluded that presentation

Page 39: Cloud fail   scaling to infinity but not beyond

Our Architecture

Page 40: Cloud fail   scaling to infinity but not beyond

PaaS / IaaS

Page 41: Cloud fail   scaling to infinity but not beyond

WEEK OF JANUARY 6Everyday is a Record Traffic Day

Page 42: Cloud fail   scaling to infinity but not beyond

Scale up on IaaS

• Someone trying to generate a 150 page PDF

• The norm is 10-15 pages…

• “OutOfMemoryException”

Page 43: Cloud fail   scaling to infinity but not beyond

Thursday, January 9, 2014

Page 44: Cloud fail   scaling to infinity but not beyond

Whoo Hoo!

• No Issues on our highest traffic day ever!

• “Can’t wait till that number hits 250 per minute!”

• “Tomorrow will be our biggest day yet!”

Page 45: Cloud fail   scaling to infinity but not beyond

Friday, January 10, 2014

• Approximately 12:00 Noon• Site traffic is around 185 people, 50 less than the previous day’s high• 1 out of every 12 hits times out• According to Rackspace, a node is failing on cloudsites and will be taken

out of rotation• About 10 complaints so far, but I email “Everything is under control”

• Approximately 12:30 PM• Traffic falls to about 150 people per minute• Things are fine• Phew

Page 46: Cloud fail   scaling to infinity but not beyond

Friday, January 10, 2014

• At 1:00 PM we have a job interview for a new support person

• I have live chat open with Rackspace and am hopping back and forth between the interview --- not the best way to hire someone

• 1:45 PM interview over, and I learn traffic is at 220+ people.

• The site is pretty much dead

• While I work on the issue, my phone is ringing with an frightened customer. Our help desk is filling up with complaints non-stop

• With a stone-cold face, I walk to my teammates. “This is bad. I need help”

Page 47: Cloud fail   scaling to infinity but not beyond

Backup Plan

• I knew CloudSites had some limit, but I had a plan to shift traffic at a moment’s notice in a worst case situation

Page 48: Cloud fail   scaling to infinity but not beyond

Backup Plan Now in Play

• Using CloudFlare, a service that lets us rapidly change DNS records; traffic was redirected to the super server

• 1 second later

Page 49: Cloud fail   scaling to infinity but not beyond

Backup Plan Part II (Scale Up)

• OK – I’ll spin up the most powerful server I can buy.

• 64 GB RAM

• 32 vCPU

Page 50: Cloud fail   scaling to infinity but not beyond

Backup Plan Part II

• 19 seconds later

Page 51: Cloud fail   scaling to infinity but not beyond

3:25 PM

• Rackspace gives me a one time “boost” to capacity

• Let’s me know about “HTE” for the future….• “If you are having a high traffic event, let us know in advance”

• I kiss the floor. My company is saved by the whim of my hosting company

Page 52: Cloud fail   scaling to infinity but not beyond

9:00 PM

• Zack and I finish responding to customer complaints

• It would be weeks before I could sleep normally again

Page 53: Cloud fail   scaling to infinity but not beyond

What the heck happened?• The initial load test was testing people submitting one application at a time

• The PDF issue was actually a harbinger of things to come

• Thursday had record traffic, but Friday had people doing “Finalization” (commits)

• Our commit code was very slow, and used a lot of RAM. As a server would get overloaded, the app pool would restart – this would add load to other servers

• Demand > Supply caused a chain reaction making servers continually failing until more supply was added

Page 54: Cloud fail   scaling to infinity but not beyond

Our Future Plans

• I’m too scared of PaaS for a complex use case!

• Not enough data to know when things fail.

Page 55: Cloud fail   scaling to infinity but not beyond

Thanks!

Kunal Johar

[email protected]


Recommended