Server architecture & scaling strategy for a sports website

server architecture, availability & scaling strategy

CTOleonidas tsementzis

leonidas tsementzisaka @goldstein

CTO Mobile architect CTO

# who’s talking

* software architect, engineer[all major web/mobile platforms]

* devOps[enthusiast, not a real sysadmin]

* entrepreneur[n00b]

# the high-level requirements

* 2007

* take sport.gr to the next level...

* ... make sure it works smoothly...

* ... and fast enough

# i can see clearly now :)

* videos[goals, match coverage]

* comments[the blogging age, remember?]

* live streaming[ustream does not exist, yet]

* live coverage of events [cover it live does not exist, yet]

* user-centric design[personalization, ratings]

* even more videos[I can haz more LOLCats]

# the problem :(

* we are planning for a 150% traffic growth but[6 months planning ahead]

$ video costs[bandwidth cost: 1€/GB]

* comments costs[DB writes, CPU, disk i/o]

* live streaming costs[bandwidth cost: 1€/GB]

* limited iron resources, not happy with our current host[dedicated managed servers in top GR Datacenter]

# S3 to the rescue

* 87% cost reduction

[0.13€/GB VS 1€/GB]

* made videos section possible...

* ...and advertisers loved it ($$$+)

* first GR site to focus on video, key competitive advantage

* 6TB video traffic in the first month

* hired a video editing team to support the demand

# EC2 servers on demand

* 3x(n) Application servers for the main website[Windows 2003, IIS 6]

* 2x(n) Application servers for APIs

[Windows 2003, IIS 6]

* 2x(n) Servers for banner managers[CentOS, Apache, OpenX]

* 1x Storage server

* 2x Database servers[MS SQL Server 2008 with failover]

* 2x Reverse Proxy cache servers

[Squid]

* 2x Load Balancers[HAProxy with failover]

* 1x monitoring server

[munin with a lot of custom plugins]

# a nice headache

:( :( :’(

# a typical week

* peaks at 3k hits/sec once or twice/week

* normal ratio at 300 hits/sec

* you can’t afford the 1st

* you can’t deliver on the 2nd

# auto-scaling to the rescue

* if average CPU usage grows over 60% for 2 minutes, add

another application server

* if average CPU usage falls below 30% for 5 minutes, kill

gracefully an application server

* 20 instances on peaks

* 3 instances (minimum) on normal operations

* no more “Server is busy” errors

* pay only what you (really) need

* you can now sleep at nights

* 60% overall cost reduction

# wait, there’s more!

* CDN & media streaming with CloudFront

* use multiple CNAMES with CloudFront to boost HTTP

requests

[as per YSlow recommends]

* CloudFront custom domains are sexy

* robust DNS with Route 53

* simple monitoring with CloudWatch

[you still need an external monitoring tool]

# SUM()

* S3PhotosVideosStatic banners

* EC2Main websiteSQL DatabasesBackofficeAPIsBanner ManagersCache serversLoad Balancers

* CloudWatchAuto-scaling

Simple Monitoring

* CloudFrontVideo streamingCDN

* ELBLoad balancing

* RDSMySQL databases

* Route 53DNS Resolution

# lessons learned

* test, iterate, test, iterate

* reserved instances saves you $$

* EC2 is a hacker playground[prepare for DOS attacks]

* backup entire AMIs to S3

[instances *WILL* #FAIL]

* EBS disk I/O is slow, but amazon is working on this[problems with DB writes]

* spawning new instances is slow

[15 mins provisioning can be a show stopper on scaling]

* S3 uploads/downloads are slow

* sticky session is a must

[we replaced AWS ELB with HAProxy just for this]

* SLAs can't guarantee high availability[AWS *WILL* #FAIL]

# more lessons learned

* devOps are hard to find [interested? I’m hiring]

* automate everything

[makes you sleep at night]

* monitor everything [munin is your friend]

* disaster prevention

[work *ALWAYS* around the worst case scenario]

* windows server administration is a mess [and AWS is not making this prettier]

* DB scale is the hardest part

[code changes]

* legacy software *IS* a problem** on scaling

** on hiring

** on growing (have you tried to use XMPP via ASP?)

# AWS is not perfect

* AKAMAI is still faster compared to CloudFront[especially in Greece]

* not affordable for large architectures

[if you’re running 300+ instances, you should consider making your own

datacenter]

# questions? challenges?

?@goldsteinaka leonidas tsementzisleotsem [at] gmail.com

# thank you

@goldsteinaka leonidas tsementzisleotsem [at] gmail.com

!

Date post:	18-Dec-2014
Category:	Technology
Upload:	leonidas-tsementzis
View:	1,731 times
Download:	2 times

Server architecture & scaling strategy for a sports website

Technology