Building an experimentation framework

Post on 06-May-2015

1,890 views 2 download

description

OSCON talk on building a simple but powerful framework for feature ramp ups, A/B and multivariate testing, and other types of experiments in web apps.

transcript

Building an experimentation framework for web apps

Zhi-Da Zhongzz@etsy.com

Tuesday, July 26, 2011

About the talk

Why

What

Framework

Break / hack

Tech Details

Test design

Analysis

Tuesday, July 26, 2011

Why?

Tuesday, July 26, 2011

Questions

“What will happen if I do X”?

“Is X better than Y?”

Tuesday, July 26, 2011

The future &

alternate universes (We’re bad at those.)

Tuesday, July 26, 2011

Then what?

Tuesday, July 26, 2011

Experiments

Tuesday, July 26, 2011

Try it out.

Experiments

Tuesday, July 26, 2011

Try it out.

Data beats speculation.

Experiments

Tuesday, July 26, 2011

Try different alternatives

on different people.

Experiments

Tuesday, July 26, 2011

Try different alternatives

on different people.

Experiments

Tuesday, July 26, 2011

Which is better?

v.s.

Tuesday, July 26, 2011

Not a great experiment

Tuesday, July 26, 2011

Web apps

Tuesday, July 26, 2011

Front end experiments

• Layout, colors, images, copy, ...

• No functional changes

• Impact can be surprisingly high

Tuesday, July 26, 2011

A little more complex...

• Multipage flows

• Functionality changes

Tuesday, July 26, 2011

Backend experiments

• Why not?

• Algorithms, architectures, batch processes, ...

Tuesday, July 26, 2011

The Etsy search backend

• New algorithm

• New RPC protocol

• New result data structure

• New Solr trunk snapshot

Web app

Search cluster A

Search cluster B

search()

searchA() searchB()

Tuesday, July 26, 2011

DB re-architecture

• Postgres => Sharded MySQL

• Multiple experiments

Tuesday, July 26, 2011

Whole new features

New pages+

New DB tables+

New batch jobs+...

Tuesday, July 26, 2011

Not just 2 variants

• A/B/C... tests

• Multi-variate tests

Tuesday, July 26, 2011

Caveats

• Content not under your control

• Price tests?

• Hard-to-measure/quantify things

• Long term impact?

Tuesday, July 26, 2011

Other tests

• Internal users testing

• Whitelisted user testing

Tuesday, July 26, 2011

Opt-in experiments

Tuesday, July 26, 2011

Complementary techniques

• Observed/recorded testing

- show different people the same thing

• Side-by-side testing

- show each person 2 alternatives

Tuesday, July 26, 2011

Side by side testing

Tuesday, July 26, 2011

How

Tuesday, July 26, 2011

A common approach

• JS-based

• Non-techie UI

• “No IT!”

• “Designed For Marketers, By Marketers”

Tuesday, July 26, 2011

• The developer is the user

• Code as configuration

• An integral part of the dev process

Our approach

Tuesday, July 26, 2011

Developer as the user

• The builder of the feature writes the test

• Not just a marketing tool

Tuesday, July 26, 2011

Code as config

• Simplicity

• Expressivity

• Quality

• Version => complete system state

• Revision history

Tuesday, July 26, 2011

Part of the dev process

Every change is an experiment!

Tuesday, July 26, 2011

What does it look like?

Tuesday, July 26, 2011

Tuesday, July 26, 2011

Default => Experiment => (new) Default

Tuesday, July 26, 2011

To add a new feature...

+ $config[‘new_search’] = array(+ ‘enabled’ => ‘off’+ );

function search() {+ if ($cfg->isEnabled(‘new_search’)) {+ return do_new_search();+ }

// existing stuff}

Tuesday, July 26, 2011

Deploy that

Tuesday, July 26, 2011

Now we go crazy...

function do_new_search() { // exciting new stuff // that might or might not work // but we can deploy it anyway // since it’s flagged off}

Tuesday, July 26, 2011

Internal user testing

$config[‘new_search’] = array(+ ‘enabled’ => ‘rampup’,+ ‘rampup’ => array(+ ‘admin’ => true

));

Tuesday, July 26, 2011

$config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array(

+ ‘whitelist’ => array('zhida'), ‘admin’ => true ));

Whitelists

Tuesday, July 26, 2011

$config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array(

+ ‘group’ => 12345, ‘admin’ => true ));

Opt-in experiments

Tuesday, July 26, 2011

$config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array(

+ ‘percent’ => 1.5, ‘admin’ => true ));

A/B

Tuesday, July 26, 2011

$config[‘new_search’] = array(+ ‘enabled’ => ‘on’

);

If it works...

Tuesday, July 26, 2011

Order matters

Whitelist / Blacklist > Internal > Opt-in > Random

Tuesday, July 26, 2011

The framework

Tuesday, July 26, 2011

As easy as...

Tuesday, July 26, 2011

As easy as...

1. Pick a variant

Tuesday, July 26, 2011

As easy as...

1. Pick a variant

2. Do what it says

Tuesday, July 26, 2011

As easy as...

1. Pick a variant

2. Do what it says

3. Log the event

Tuesday, July 26, 2011

What's in a test?

Tuesday, July 26, 2011

Variants

• Key-value pairs

• interpreted by the app

• Name

• mostly for logging

Tuesday, July 26, 2011

SubjectIdProvider

• Why?

• hashing and other selectors

• logging

• Types of subjects

• Users...but not always

• Different groups of users - sellers vs buyers, etc.

• Different ways to identify them - signed in vs signed out

function getID()

Tuesday, July 26, 2011

Selectors

function select($subjectID) => Variant Name

Tuesday, July 26, 2011

Combining multiple selectors

• OR

• breaks blacklists

• AND

• breaks whitelists

• Sequence

• works!

Tuesday, July 26, 2011

Selector sequence

• Defines an ordering

• Returns A/B/C/... or <don't care>

Tuesday, July 26, 2011

Loggers

function log($testKey, $variantKey, $subjectKey)

Tuesday, July 26, 2011

More => better

• More data

• More ways to track

• access logs

• 3P analytics

• custom

Tuesday, July 26, 2011

Access log augmentation

• Apache note

• Lots of log analysis tools

• grep

• $$

Tuesday, July 26, 2011

3P Analytics

• Quick to start

• May be cheap

• Volume?

• Lag time?

• Flexibility / customization?

Tuesday, July 26, 2011

3P Analytics - how

• Custom variables

• take note of number & size limits

• Custom segments

• Canned metrics

Tuesday, July 26, 2011

3P Analytics - example

<script type="text/javascript">var pageTracker = _gat._getTracker("UA-1234567-8");

pageTracker._initData();

pageTracker._setCustomVar(2, "AB", "search_test.variantC", 3);

pageTracker._trackPageview();

</script>

Tuesday, July 26, 2011

Our own event tracking

• HTML beacons

• Hadoop

• Cloud

Web appHTML, JS

Hadoop

eventbeacon

Event log

Results

Tuesday, July 26, 2011

Break / hackhttps://github.com/etsy/ab

Tuesday, July 26, 2011

Building on top of the core API

Tuesday, July 26, 2011

Test builders

• Capture common patterns

• feature ramp ups

• opt-in experiments

• Help with test design

• weight equalization

• multivariate testing

Tuesday, July 26, 2011

Automatic Dispatchers

• Separate dispatching and work

• Work with components that have well-defined invocation APIs

• Define a particular level of granularity

• Feel like magic

Tuesday, July 26, 2011

Dispatcher example - MVC

• View dispatch

• Controller dispatch

• Spring framework, etc.

Tuesday, July 26, 2011

Selector Registry

• Reuse

• Clarity

• Documentation

$selectorReg = array( ‘staff’ => ‘InternalUserSelector’, ‘whitelist’ => ‘WhitelistSelector’, ‘percent’ => ‘WeightedSelector’);

Tuesday, July 26, 2011

Randomized Selector

Tuesday, July 26, 2011

What does it mean?

Tuesday, July 26, 2011

What does it mean?

• Independent of subject attributes

Tuesday, July 26, 2011

What does it mean?

• Independent of subject attributes

• Independent of other tests

Tuesday, July 26, 2011

What does it mean?

• Independent of subject attributes

• Independent of other tests

• Independent of (coarse-grained) time

Tuesday, July 26, 2011

Persistence

Tuesday, July 26, 2011

Persistence

• Better experience

Tuesday, July 26, 2011

Persistence

• Better experience

• Better data

Tuesday, July 26, 2011

Persistence

• Better experience

• Better data

• Multi-part tests

Tuesday, July 26, 2011

Persistence

• Better experience

• Better data

• Multi-part tests

• ...but not forever

Tuesday, July 26, 2011

Ramping up/down

• Vary group sizes

• Reduce risk

• Distribute load

Tuesday, July 26, 2011

Persistence + Ramping

• Minimize inconsistency

• Ramping up

• Should just add people to the treatment group

• Ramping down

• Should just remove part of the treatment group

Tuesday, July 26, 2011

rand()

• Explicit persistence

• Cookie

• DB

• Scaling

• Maintenance

Tuesday, July 26, 2011

Hashing

variant = H(id)

Tuesday, July 26, 2011

Hashing

variant = H(id)

Persistence

Tuesday, July 26, 2011

Hashing

variant = H(id)

Persistence

Tuesday, July 26, 2011

Hashing

variant = H(id)

Persistence

Attribute independence

Tuesday, July 26, 2011

Hashing

variant = H(id)

Persistence Attribute independence

Tuesday, July 26, 2011

Hashing

variant = H(id)

Persistence

Test independence?

Attribute independence

Tuesday, July 26, 2011

Hashing

variant = H(test id, id)

Persistence

Test independence

Attribute independence

Tuesday, July 26, 2011

Hashing

variant = H(test id, id)

Persistence Test independenceAttribute independence

Tuesday, July 26, 2011

Hashing

variant = H(test id, id)

Persistence

What else?

Test independenceAttribute independence

Tuesday, July 26, 2011

Hashing

variant = H(test id, id)

Persistence

Weights!

Attribute independence Test independence

Tuesday, July 26, 2011

Hashing

h = H(test id, id)

Persistence Attribute independence Test independence

Tuesday, July 26, 2011

Hashing

h = H(test id, id)

variant = P(h, weights)

Persistence Attribute independence Test independence

Tuesday, July 26, 2011

Partitioning

Hash

0 1

Tuesday, July 26, 2011

Partitioning

Hash

0 1

Partition

.5

Tuesday, July 26, 2011

Partitioning

Hash

0 1

Partition

A B.5

Tuesday, July 26, 2011

Ramping up

Hash

0 1

Partition

A B.7

Tuesday, July 26, 2011

Which hash function?

• MD5/SHA-256/...

• Test it!

• But be careful...

Tuesday, July 26, 2011

A/B + opt-in

• Need to separate the groups for analysis

• Solution: use more than 2 variants!

• Act according to variant properties

• Track by variant name

Tuesday, July 26, 2011

Analysis

Tuesday, July 26, 2011

... Confidence interval ... something something ... Binomial ... blah blah ...

Tuesday, July 26, 2011

• How sure are we?

• What if it were random?

Confidence Intervals

Tuesday, July 26, 2011

Binomial experiments

Tuesday, July 26, 2011

H T H T T T H T H H

Binomial experiments

Tuesday, July 26, 2011

H T H T T T H T H H

T H T H T T H H T H

Binomial experiments

Tuesday, July 26, 2011

Results

Tuesday, July 26, 2011

Dashboards

Tuesday, July 26, 2011

A few test design tips

Tuesday, July 26, 2011

Whatʼs the question?

Tuesday, July 26, 2011

Whatʼs the question?

What metrics?

Tuesday, July 26, 2011

Whatʼs the question?

What metrics?

How much better?

Tuesday, July 26, 2011

Who?

• Different roles

• Old vs new

• Novelty

• Habit

• Expectation

Tuesday, July 26, 2011

When?

• User types vary

• Activity patterns vary

• Site content might vary

• Performance might vary

• Full weeks are often a good starting point

Tuesday, July 26, 2011

Summary

Tuesday, July 26, 2011

Better living through experimentation

• More risk taking => better product

• MTTR

• Lower stress

Tuesday, July 26, 2011

You can too.

Tuesday, July 26, 2011