Date post: | 18-Dec-2014 |
Category: |
Technology |
Upload: | leonidas-tsementzis |
View: | 1,731 times |
Download: | 2 times |
server architecture, availability & scaling strategy
CTOleonidas tsementzis
leonidas tsementzisaka @goldstein
CTO Mobile architect CTO
# who’s talking
* software architect, engineer[all major web/mobile platforms]
* devOps[enthusiast, not a real sysadmin]
* entrepreneur[n00b]
# the high-level requirements
* 2007
* take sport.gr to the next level...
* ... make sure it works smoothly...
* ... and fast enough
# i can see clearly now :)
* videos[goals, match coverage]
* comments[the blogging age, remember?]
* live streaming[ustream does not exist, yet]
* live coverage of events [cover it live does not exist, yet]
* user-centric design[personalization, ratings]
* even more videos[I can haz more LOLCats]
# the problem :(
* we are planning for a 150% traffic growth but[6 months planning ahead]
$ video costs[bandwidth cost: 1€/GB]
* comments costs[DB writes, CPU, disk i/o]
* live streaming costs[bandwidth cost: 1€/GB]
* limited iron resources, not happy with our current host[dedicated managed servers in top GR Datacenter]
# S3 to the rescue
* 87% cost reduction
[0.13€/GB VS 1€/GB]
* made videos section possible...
* ...and advertisers loved it ($$$+)
* first GR site to focus on video, key competitive advantage
* 6TB video traffic in the first month
* hired a video editing team to support the demand
# EC2 servers on demand
* 3x(n) Application servers for the main website[Windows 2003, IIS 6]
* 2x(n) Application servers for APIs
[Windows 2003, IIS 6]
* 2x(n) Servers for banner managers[CentOS, Apache, OpenX]
* 1x Storage server
* 2x Database servers[MS SQL Server 2008 with failover]
* 2x Reverse Proxy cache servers
[Squid]
* 2x Load Balancers[HAProxy with failover]
* 1x monitoring server
[munin with a lot of custom plugins]
# a nice headache
:( :( :’(
# a typical week
* peaks at 3k hits/sec once or twice/week
* normal ratio at 300 hits/sec
* you can’t afford the 1st
* you can’t deliver on the 2nd
# auto-scaling to the rescue
* if average CPU usage grows over 60% for 2 minutes, add
another application server
* if average CPU usage falls below 30% for 5 minutes, kill
gracefully an application server
* 20 instances on peaks
* 3 instances (minimum) on normal operations
* no more “Server is busy” errors
* pay only what you (really) need
* you can now sleep at nights
* 60% overall cost reduction
# wait, there’s more!
* CDN & media streaming with CloudFront
* use multiple CNAMES with CloudFront to boost HTTP
requests
[as per YSlow recommends]
* CloudFront custom domains are sexy
* robust DNS with Route 53
* simple monitoring with CloudWatch
[you still need an external monitoring tool]
# SUM()
* S3PhotosVideosStatic banners
* EC2Main websiteSQL DatabasesBackofficeAPIsBanner ManagersCache serversLoad Balancers
* CloudWatchAuto-scaling
Simple Monitoring
* CloudFrontVideo streamingCDN
* ELBLoad balancing
* RDSMySQL databases
* Route 53DNS Resolution
# lessons learned
* test, iterate, test, iterate
* reserved instances saves you $$
* EC2 is a hacker playground[prepare for DOS attacks]
* backup entire AMIs to S3
[instances *WILL* #FAIL]
* EBS disk I/O is slow, but amazon is working on this[problems with DB writes]
* spawning new instances is slow
[15 mins provisioning can be a show stopper on scaling]
* S3 uploads/downloads are slow
* sticky session is a must
[we replaced AWS ELB with HAProxy just for this]
* SLAs can't guarantee high availability[AWS *WILL* #FAIL]
# more lessons learned
* devOps are hard to find [interested? I’m hiring]
* automate everything
[makes you sleep at night]
* monitor everything [munin is your friend]
* disaster prevention
[work *ALWAYS* around the worst case scenario]
* windows server administration is a mess [and AWS is not making this prettier]
* DB scale is the hardest part
[code changes]
* legacy software *IS* a problem** on scaling
** on hiring
** on growing (have you tried to use XMPP via ASP?)
# AWS is not perfect
* AKAMAI is still faster compared to CloudFront[especially in Greece]
* not affordable for large architectures
[if you’re running 300+ instances, you should consider making your own
datacenter]
# questions? challenges?
?@goldsteinaka leonidas tsementzisleotsem [at] gmail.com
# thank you
@goldsteinaka leonidas tsementzisleotsem [at] gmail.com
!