Cloud Scale Lessons Learned

Post on 16-Aug-2015

32 views 2 download

Tags:

transcript

Cloud Scale: AWS and AzureLessons Learned

October 15th, 2014

Nick Stephens

Cloud Scale Challenge

• Pariveda held an internal competition to build a highly scalable cloud application

• The application had to be built on 2 of the most popular clouds– AWS and Azure

• It was a great learning experience

Competition - Rules

• Build simple E-commerce site– Search for Products– Add to Cart– Submit Order

• Build on both AWS and Azure– Must use 3 services each cloud offers

• Best performance for price wins

Competition - SLAs

• Search for Product– 600,000 requests/min with response in 1 sec

• Add to Cart– 30,000 requests/min with response within 500 ms– Request must be persisted within 10 sec

• Submit Order– 3,000 requests/min with response within 500 ms– Request must be persisted within 10 sec

Competition - Deliverables

• Teams pick their most cost effective solution

• Demo chosen solution to judges

• Must prove SLAs were met by generating load on system

My Team’s Solution

• Strategy– Re-use as much as possible• Chose IaaS over PaaS for portability

– Pick right technology for problem• Chose NodeJS because of high networking and low CPU need

– Handle Add to Cart and Submit Order requests asynchronously• Queue request to scale more easily

My Team’s Solution

• Development– Coded to interface to abstract cloud specific storage logic• Separate implementations for each cloud

– Used Redis as a queue with Redisq library• VM with Redis on AWS • Redis Cache on Azure

My Team’s Solution

• AWS Architecture– NodeJS Web Server– Redis Server (Queue)– NodeJS Worker

• Services Used– EC2– DynamoDB– Cloud Search

My Team’s Solution

• Testing– Needed to generate heavy load on the system to prove SLAs• Built a custom load test rig to capture client response times and request

persistence times

– Response times were captured in SQL database for easy reporting

– Used Remote Desktop to monitor servers• Watched CPU and network traffic to gauge performance

My Team’s Solution

• Competition Results– We demoed our solution but didn’t meet all SLAs• Only achieved approximately 300,000 searches/min

– We hadn’t tested our system at that scale• We realized a bottleneck during the demo

– We didn’t have all of the deployment automated• We couldn’t quickly redeploy, scale out, and retest

Winning Team’s Solution

• Development– Developed AWS and Azure solution separately• Both started out using .NET on Windows

– AWS solution switched to NodeJS on Linux• Linux servers are much cheaper than Windows

– Azure solution ended up being cheaper• Higher SQS vs Azure storage transaction costs added

Winning Team’s Solution

• Azure Architecture– .NET Web API– PaaS– Azure Storage

• Services Used– Web Roles– Worker Roles– Azure Storage

Winning Team’s Solution

• Testing– Wrote custom test harness• Could view aggregate results from test runners

– Increased application servers until meet SLAs

– Tried different sizes of instances

Lessons Learned

• Scale Out not Up– This type of problem is a network bound problem– More instances were better than larger instances

• Synchronous writes were possible for this scenario– The teams that had synchronous writes had to scale out more– Asynchronous writes can be quicker and scales better

Lessons Learned

• Capture metrics to judge performance– Metrics can show bottlenecks– Objective measure of performance

• Use existing tools whenever possible– Some teams used load testing service instead of custom tool– Allowed those teams to focus more on application

Lessons Learned

• Automate deployment as much as possible– Fast and reliable process

• No clear winner in AWS vs Azure– Team submissions were split between AWS and Azure– Each cloud had similar but unique feature sets– Either cloud could have won with right architecture

QUESTIONS?