Post on 15-Jan-2015
description
transcript
Not everything that happens in Vegas stays in Vegas
DevOpsor “getting devs to be on call for what they ship” :-)
Netflix development
Priorities
1. Speed of innovation
2. Availability
3. Running costs
a. “It’ll cost what it ends up costing”
In practise, they found that holding to the first two ended
up costing way less than otherwise expected.
Riot Games + League of Legends
Cloud == ideal for MMOs. Solve launch issues.
● chef gets used a lot here.
○ talked about their evolution with it, lessons learned
● What sucked?
○ 25 minute bootstrap runs
○ External dependencies (including S3)
○ Duplicating application deployment recipes
● golden masters and immutable servers simplify your
life drastically.
● “if you’re doing chef without BerkShelf you’re doing it
wrong”
● Make it easy to throw up new things
Testing in production
Netflix, Riot, Kickstarter - they all do this.
At scale.
Netflix
● 10s to 100s of code pushes per day
● 1000s to 100,000s of config changes per day
○ they tune their A/B testing constantly
Of course, they also have the instrumentation to react to
this.
How’re other people doing DevOps?
Good news - we’re at the “more sophisticated” end of the
spectrum.
Every “cloud native” was doing this.
Things other people did better:
● “Golden master” AMIs
● Immutable instances
● Absolute ownership of vertical slices
● Config-managment (chef/puppet) featured
prominently
● Extensive monitoring+logs+visibility == “table stakes”
○ for developers!
● Easy to throw up new things
● Run many small, simple, collaborating things
Who? Riot Games, Netflix, change.org, Kickstarter
Logging aggregation is important
Logging aggregation is important
Lots of 3rd party companies are offering centralized
logging services, there's a huge appetite for logging
and monitoring.
● http://logentries.com/
● http://www.loggly.com/
● http://papertrailapp.com/
● https://www.splunkstorm.com/tour
● http://www.datadoghq.com/
● DIY - Lumberjacking slides
DEMO: Monitoring & Logging
https://app.datadoghq.com/infrastructure
● Tag Metrics, awesome Metric discoverability
● Cloud Watch integration
○ I never knew I could see ELB metrics :-)
● Alarms are integrated
● You can template Dashboards
https://papertrailapp.com/
● Can Search, Save Searches, Alerts on searches
● No alert on patterns
● Archive to S3 / Push to Redshift
Logging aggregation is FOR DEVELOPERS!!!
Saves lots of time when you’re on call.
Loggly Session
Benefit of logging as a service.
● When your infrastructure is in trouble, you do not
want to have your logging analytic system on the
same infrastructure.
AWS Services that loggly could use:
● Kafka + Storm vs Kinesis
● Elastic Search vs Cloud Search
Predictive Analytics using Storm, Hadoop, R and
AWS
http://www.youtube.com/watch?v=6Sl3eBmDheE
Loggly Session
● Provisioned IOPS solve all issues :)
● ELB do not perform with extremely high volume
of requests.
● DNS round robin is a very good basic load
balancing solution
● Cassandra works very well for application data.
● Cassandra does not work well as a queue system,
hard to track order of events.
● Keep the architecture simple.
Large Scale Load Testing on AWS
Many types of load
● Load testing
○ (running a marathon), predict future load and
plan in advance
● Stress testing
○ Break things (figure out limits), mitigation
plans
● Resilience test
○ Figure out how many parts of the architecture
you can lose and still operate
● Performance test
○ How is latency and throughput changing when
the load increase
Phase roll out and measure
● Load Testing is necessary but not sufficient.
○ Deploy to alpha cluster.
○ The release cycle is important, phased
deployment, one box, monitor and ramp up.
○ Monitor performance and behaviour, look at
99% of the traffic, not at the average.
● Netflix record 1.2 billion metrics per day
○ 5 minutes SLA
Gameday
Gameday
We took part to the AWS Gameday
http://www.awsgameday.com/whatisgameday.html
Inspired by the 2012 Obama For America DevOps
and Amazon.com ops teams
● Build an Autoscaling application
● Exchange administrative IAM credentials with
other team
● Break your opponent's systems
● Restore your system
● Lessons learned
Who is interested if we wanted to run this?
It needs a full day, ~ 6 hours.
Weekday?
Weekend?
Twitter: @petemounce