Scaling Up Lookout

Post on 13-Jan-2015

10,229 views 4 download

Tags:

description

Scaling Up Lookout was originally presented at Lookout's Scaling for Mobile event on July 25, 2013. R. Tyler Croy is a Senior Software Engineer at Lookout, Inc. Lookout has grown immensely in the last year. We've doubled the size of the company—added more than 80 engineers to the team, support 45+ million users, have over 1000 machines in production, see over 125,000 QPS and more than 2.6 billion requests/month. Our analysts use Hadoop, Hive, and MySQL to interactively manipulate multibillion row tables. With that, there are bound to be some growing pains and lessons learned.

transcript

Scaling Up Lookout

R. Tyler Croygithub.com/rtyler

Hello everybody, welcome to Lookout! I'm excited to be up here talking about one of my favorite subjects, scaling.

Not just scaling in a technical sense, but scaling *everything*. Scaling people, scaling projects, scaling services, scaling hardware, everything needs to scale up as your company grows, and I'm going to talk about what we've been doing here.

First, I should talk about ->

this guy

Who I am.

- I've spoken a lot before about continuous deployment and automation, generally via Jenkins. As part of the Jenkins community, I help run the project infrastructure and pitch in as the marketing events coordinator, cheerleader, blogger, and anything else that Kohsuke (the founder) doesn't want to do.

Prior to Lookout I've worked almost entirely on consumer web applications, not in a controllers and views sense, but rather building out backend services and APIs to help handle growth

At Lookout, I've worked a lot on the Platform and Infrastructure team, before promoted, or demoted depending on how you look at it, to the Engineering Lead for ->

OMGSerious Business

The Lookout for Business team

I could easily talk for over 30 minutes about some of the challenges building business products presents, but suffice it to say, it's chock full of tough problems to be solved.

Not many companies grow to the point to where they're building out multiple product lines and revenue streams, but at Lookout we've now got Consumer, Data Platform and now Business projects underway.

It's pretty exciting, but not what I want to talk about.

Let's start by ->

Let's travel back in time

Talking about the past at Lookout. I've been here for a couple years now, so my timeline starts in ->

2011

2011

In the olden days, we did things pretty differently, in almost all aspects. I joined as the sixth member of the server engineering team, a group that now has 20-30 engineers today.

-> Coming in with a background in continuous deployment, the first thing that caught my eye was

release process

Our release process was like running a gauntlet every couple weeks, and maybe we'd ship at the end of those two weeks, maybe not. It was terribly error-prone and really wasn't that great.

James ran the numbers for me at one point, and during this time-period we were experiencing a "successful" deployment rate of ->

36%of deployments failed

This means that 1/3 of the time when we would try to deploy code into production, something would go wrong and we would have to rollback the deploy and find out what went wrong.

Unfortunately, since it took us two or more weeks to get the release out, we had on average ->

68commits per deployment

68 commits per deployment, so one or more commits out of 68 could have caused the failure.

After a rollback, we'd have to sift through all those commits and find the bug, fix it and then re-deploy.

Because of this ->

62%of deployments slipped

About 2/3rds of our deployments slipped their planned deployment dates. As an engineering organization, we couldn't really tell the product owner when changes were going to be live for customers with *any* confidence!

There were a myriad of reasons for these problems, including:

- lack of test automation (tests existed, not reliably running, using Bitten with practically zero developer feedback) - painful deployment process

To make things more difficult, all our back-end application code was in a ->

monorails

monolithic Rails application. While it served it's purpose as the company was bootstrapping itself, but was starting to show its age and prove challenging with more and more developers interacting with the repository.

The team was at an interesting junction during this time, problems were readily acknowledged with the way things were done, but finding the bandwidth and buy-in to fix them were difficult to come by.

I think every startup that grows from 20 to 100 people goes through this phase when it is in denial of it's own growing pains.

As more people joined the team, we pushed past the denial though and started working on ->

Scaling the Workflow

Scaling the workflow. Our two-ish week release cycle was first on the chopping block, we started with what became known as the ->

The Burgess Challenge

The Burgess Challenge. While having beers one night with James and the server team lead Dave, James asked if we could fix our release process and get us from two-ish week deployments to *daily* deployments, in ->

60 days

60 days. This was right at the end of the year, with Thanksgiving and Christmas breaks coming up, we had some slack in the product pipeline and decided to take the project on, and enter 2012 a different engineering org than we had left 2011.

We started the process by bringing in some ->

New Tools

New tools, starting with ->

JIRA

JIRA. While I could rant on how much I hate JIRA, I think it's a better tool than Pivotal Tracker was for us. Pivotal Tracker worked well when the team and the backlog were much smaller, and less inter-dependent than they were in late 2011.

Another tool we introduced was ->

Jenkins

Jenkins - Talk about the amount of work just to get tests passing *consistently* in Jenkins - Big change in developer feedback on test runs compared to previously.

We also moved our code from Subversion into ->

Git + Gerrit

Git and Gerrit. Gerrit being a fantastic Git-based code-review tool. At the time the security team was already using GitHub:Firewall for their work. We discussed at great length whether the vanilla GitHub branch, pull request, merge process would be sufficient for our needs and whether or not a "second tool" like Gerrit would provide any value.

I could, and have in the past, given entire presentations on the benefits of the Gerrit-based workflow, so I'll try to condense as much as possible into this slide of our new code workflow ->

describe the new workflow, comparing it to the previous SVN based one (giant commits, loose reviews, etc)

with Jenkins in the mix, our fancy Gerrit workflow had the added value of ensuring all our commits passed tests before even entering the main tree.

We doing a much better job of consistently getting higher quality code into the repository, but we still couldn't get it to production easily

Next on the fix-it-list was ->

The Release Process

The release process itself.

At the time our release process was a mix of manual steps and capistrano tasks

- Automation through Jenkins - Consistency with stages (no more update_faithful)

We've managed to change entire engineering organization such that ->

2%of deployments failed

14commits per deployment

3%of deployments slipped

neat

Automating Internal Tooling

Introducing OpenStack to provide developer accessible internal VM managementManaging of Jenkins build slaves via PuppetIntroduction of MI Stages

OpenStack

If you're going to use a pre-tested commit workflow with an active engineering organization such as ours, make sure plan ahead and have plenty of hardware, or virtualized hardware for Jenkins

We've started to invest in OpenStack infrastructure and the jclouds plugin for provisioning hosts to run all our jobs on.

With over a 100 build slaves now, we had to also make sure we had ->

Automated Build Slaves

Automated the management of those build slaves, nobody has time to hand-craft hundreds of machines and ensure that they're consistent. Additionally, we didn't want to waste developer time playing the "it's probably the machine's fault" game every time a test failed.

Per-Developer Test Instances

Scaling the People

Not much to say here, every company is going to be different but you can't just ignore that there are social and cultural challenges in taking a small engineering team and growing to 100+ people.

- Transition from talking about the workflow to the tech stack

Scaling the Tech Stack

With regards to scaling the technical stack, I'm not going to spend too much time on this since the other people here tonight will speak to this in more detail than I probably should get into, but there are some major highlights from a server engineering standpoint

Starting with the databases ->

Shard the Love

Global Derpbase woes Moving more and more data out of non-sharded tables Experimenting with various connection pooling mechanisms (worth mentioning?)

Undoing Rails Katamari

Diagnosing a big ball of mud Migrating code onto the first service (Pushcart) Slowly extracting more and more code from monorails, project which is ongoing

Modern JavaScript

I never thought this would have a big impact on scaling the technical stack, but modernizing our front-end applications has helped tremendously

The JavaScript community has changed tremendously since the company was founded, the ecosystem is much more mature and the web in general has changed.

By rebuilding front-end code as single-page JavaScript applications (read: Backbone, etc), we are able to reduce complexity tremendously on the backend by turning everything into more or less JSON API services

Infinity and Beyond

The future at Lookout is going to be very interesting, both technically and otherwise.

On the technical side of the things we're seeing more of a ->

Diversifyingthe

technical portfolio

Diversified technical portfolio. Before the year is out, we'll have services running running in Java, Ruby and even Node.

TO support more varied services, we're getting much more friendly ->

Hello JVM

with the JVM, either via JRuby or other JVM-based languages. More things are being developed for and deployed on top of the JVM. Which offers some interesting some interesting opportunities to change our workflow further with things like: - Remote debugging - Live profiling - Better parallelism

With an increasingly diverse technical stack, and stratified services architecture, we're going to be faced with the technical and organization challenges of operating ->

100 services

100 services at once.

When a team which owns a service is across the office, or across the country, how does that mean for clearly expressing service dependencies, contracts and interactions on an on-going basis?

With all these services floating around, how do we maintain our ->

Institutional Knowledge

Institutional knowledge amongst the engineering team

Growth means the size of our infrastructure exceeds the mental capacity of singular engineers to understand each component in detail.

We're not alone in this adventure, we have much to learn from companies like Amazon, or Netflix, who have traveled this path before.

I wish I could say that the hard work is over, and that it's just smooth sailing and printing money from here on out, but that's not true.

There's still a lot of hard work to be done, and difficult problems to talk about as we move into a much more service-oriented, and multi-product architecture.

I'd like to ->

Thank you

Thank you for your time, if you have any questions for me, I'll be sticking around afterwards.

Thank you