+ All Categories
Home > Technology > 2012 - A Release Odyssey

2012 - A Release Odyssey

Date post: 15-Jan-2015
Category:
Upload: ernest-mueller
View: 75 times
Download: 3 times
Share this document with a friend
Description:
Lightning talk for DevOpsDays Austin 2013 on taking releases from a 10 week to 1 week cadence. Sorry about the format, had to go from Keynote to PDF and since it was a lightning talk all the actual content's in the notes.
Popular Tags:
20
DevOpsDays Austin 2013 @ernestmueller| @bazaarvoice 2012: A Release Odyssey Hi, I’m Ernest Mueller from Bazaarvoice here in Austin. We’re the biggest SaaS company you’ve never heard of; our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution on their sites for that purpose. We pushed out more than 1bn reviews last Cyber Monday. I’m going to tell you how we went from releasing our code once every ten weeks to once a week in a pretty short time.
Transcript
Page 1: 2012 - A Release Odyssey

DevOpsDays Austin 2013

@ernestmueller| @bazaarvoice

2012: A Release Odyssey

Hi, I’m Ernest Mueller from Bazaarvoice here in Austin. We’re the biggest SaaS company you’ve never heard of; our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution on their sites for that purpose. We pushed out more than 1bn reviews last Cyber Monday. I’m going to tell you how we went from releasing our code once every ten weeks to once a week in a pretty short time.

Page 2: 2012 - A Release Odyssey

The Monolith! Bazaarvoice Conversations, aka PRR, has 15,000 files and 4.9M lines of code, the oldest from Feb 2006, and that’s not counting UI versions, customer config, or operations code repos (all of which get released along with it). Written by generations of coders, including outsourcer partners.

It runs across 1200 hosts in 4 datacenters; Rackspace, and AWS East, West, and Ireland.

So by any measure this was a large legacy system.

Page 3: 2012 - A Release Odyssey

BV had gone agile and said “Let’s release more quickly too! All the cool kids are doing it! We’re doing two week sprints, so let’s release biweekly - go! They tried it two weeks after a big ten-week release, and PRR v5.1 launched on January 19th, 2012.Whoops, it’s not that easy - 44 client tickets logged, mass hysteria. “Let’s not do that again!”

Page 4: 2012 - A Release Odyssey

Enter yours truly on January 30th. “You’re hired! We want biweekly releases in a month. With zero user facing downtime. Failure is not an option! Go!”It wasn’t just an irrational need for speed, the product organization wanted to get faster A/B testing, more piloting, etc. and the engineering team wanted the benefits of a more continuous flow as well.

Page 5: 2012 - A Release Odyssey

Careful analysis of the situation was warranted. Luckily a SWAT team had been analyzing the problem already. The two major impediments, which are frequently encountered factors in legacy implementations:

• Lack of automation in testing - testing was a huge burden and couldn’t be done sufficiently in the time allotted

• Poor SCM code discipline - checkins continuing up to the release

Page 6: 2012 - A Release Odyssey

Path One - Testing! We hired up QA automation people and set them to work. We set the expectation, backed up strongly by the product team, that the development teams had to stop and do three testing sprints. We have a standard four-environment setup - dev, QA, staging, production.

Page 7: 2012 - A Release Odyssey

JUnit testing and CIT testing in TeamCity was ramped up.A selenium-based “Testmaster” system was used to improve the level of regression automation to safe levels.More importantly perhaps, a new discipline of not running all the tests all the time - feature/story in dev, regression in QA, smoke testing in staging and production

Page 8: 2012 - A Release Odyssey

Branching - changed over to a trunk/release branch model, splits off every 2 weeks, no commits to branch without going through a code freeze break process. Process enforcement via wiki!Trunk goes to dev twice daily, branch goes to QA, when labeled “verified” it goes to staging and then to production.

Page 9: 2012 - A Release Odyssey

We also had a team write a feature flagging system, like the cool kids use, so we could launch features dark and then enable them later. We made the rule that all new features must be launched dark.

Page 10: 2012 - A Release Odyssey

We couldn’t fix a couple things in time. Our Solr indexes are 20 GB and reindexing and distributing them, while doing a zero downtime deployment and keeping replication lag down needed more engineering.And our build and deploy system was pretty bad. It’s buzzword compliant - svn, TeamCity, maven, yum, puppet, rundeck, noah, but it’s actually a bit of spaghetti mess in a big crufty bash framework; builds take more than an hour and deploys take 3+ hours.

Page 11: 2012 - A Release Odyssey

We got a delay of game due to our IPO and then were “no go” March 1. We were under a lot of management pressure to ship, but tests weren’t passing and at the new go/no-go meeting the dev managers sucked it up and declared “no go.”

Page 12: 2012 - A Release Odyssey

First biweekly release - PRR 5.2 went out on March 6, 5 days late. 5 issues were reported by customers. 5.3 went out March 22, 1 issue reported. 5.4 went out April 5, zero issues reported. I kept in depth release metrics - number of checkins, number of process faults, number of support tickets - and they showed consistent improvement.

Page 13: 2012 - A Release Odyssey

It took a lot of collaboration and good old fashioned project management. Product, QA, DevOps, various engineering teams, Support, and other stakeholders had to all get on the same page.We didn’t really change tooling besides adding the feature flagging - still Confluence, JIRA, and all our other tools - just using them more effectively.

http://www.flickr.com/photos/senorwences/2366892425/

Page 14: 2012 - A Release Odyssey

And the release train kept spinning. We had one major disaster on May 17, when a major architectural change to our product feeds went out in a release and generated 28 client reported issues (from a nice rolling average of .5). We enhanced our process to link each svn checkin to a ticket and put together a page requiring per-ticket signoff from the release and started tracking more quality metrics. This got us consistently smooth releases through the summer of 2012.

Page 15: 2012 - A Release Odyssey

But we weren’t done there. We wanted to totally pwn the old way, and the next step was weekly releases. There were still some parts of the process that were manual and painful, and we were still having some “misses” causing production issues. “If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.

Page 16: 2012 - A Release Odyssey

This was a lot easier - the QA team worked in the background to get the test coverage numbers up and then we said to the teams, “We’re going weekly in two weeks... Same process otherwise.”Version 6.7 launched on September 27, a week after 6.6. Client reported issues stemming from a code release average around zero since that time.Solr index distribution was automated; they get regenerated before, shipped out to the data centers, brought up to date, and then swapped in during releases.Solr reindexing automation went live October 18, 2012.Then we trained the developers to take over the release process.We skipped some releases during Black Friday, but are shipping PRR 9.0 this week (in most of our absence!).

Page 17: 2012 - A Release Odyssey

As I mentioned, our build and deployment is already automated (somewhat sketchily) with TeamCity, puppet, Rundeck, and noah.Our next step in killing off the old way is in progress by renovating our build system - moving to git with gerrit for code reviewing, and upgrading our TeamCity installation so it can be API controlled - and fixing the crappy CIT tests that have been languishing there. We have trouble currently with failing CIT because we don’t block people on it, because the failures are intermittent. We’ll get build and CIT running fast (current 1 hour build 40 minute CIT).

Page 18: 2012 - A Release Odyssey

After that we will get rid of the bash-spaghetti deployment system we have and making deploys faster and better (current 3 hours). We’re removing the separate staging roll (staging = production because it’s client facing) and go to continuous deployment off trunk to our QA system. Some of this is technology-faster and some is process-faster - having to promote up four environments, when it takes 4 hours per, and when staging and production have to happen in maintenance windows, is slow.

Page 19: 2012 - A Release Odyssey

And eventually... Continuous deployment. The cloud kids get to start there, but it takes some heavy lifting to get a large, established system there. But that’s the sequel, 2013: A Release Odyssey.

Page 20: 2012 - A Release Odyssey

And that’s my story!Hit me up at theagileadmin.comAnd thanks to 2001: A Space Odyssey for all the screen caps I used as part of this presentation.


Recommended