Building an experimentation framework

transcript

Building an experimentation framework for web apps

Zhi-Da Zhongzz@etsy.com

Tuesday, July 26, 2011

About the talk

Framework

Break / hack

Tech Details

Test design

Analysis

Questions

“What will happen if I do X”?

“Is X better than Y?”

The future &

alternate universes (We’re bad at those.)

Then what?

Experiments

Try it out.

Experiments

Try it out.

Data beats speculation.

Experiments

Try different alternatives

on different people.

Experiments

Try different alternatives

on different people.

Experiments

Which is better?

Not a great experiment

Web apps

Front end experiments

• Layout, colors, images, copy, ...

• No functional changes

• Impact can be surprisingly high

A little more complex...

• Multipage flows

• Functionality changes

Backend experiments

• Why not?

• Algorithms, architectures, batch processes, ...

The Etsy search backend

• New algorithm

• New RPC protocol

• New result data structure

• New Solr trunk snapshot

Web app

Search cluster A

Search cluster B

search()

searchA() searchB()

DB re-architecture

• Postgres => Sharded MySQL

• Multiple experiments

Whole new features

New pages+

New DB tables+

New batch jobs+...

Not just 2 variants

• A/B/C... tests

• Multi-variate tests

Caveats

• Content not under your control

• Price tests?

• Hard-to-measure/quantify things

• Long term impact?

Other tests

• Internal users testing

• Whitelisted user testing

Opt-in experiments

Complementary techniques

• Observed/recorded testing

- show different people the same thing

• Side-by-side testing

- show each person 2 alternatives

Side by side testing

A common approach

• JS-based

• Non-techie UI

• “No IT!”

• “Designed For Marketers, By Marketers”

• The developer is the user

• Code as configuration

• An integral part of the dev process

Our approach

Developer as the user

• The builder of the feature writes the test

• Not just a marketing tool

Code as config

• Simplicity

• Expressivity

• Quality

• Version => complete system state

• Revision history

Part of the dev process

Every change is an experiment!

What does it look like?

Default => Experiment => (new) Default

To add a new feature...

+ $config[‘new_search’] = array(+ ‘enabled’ => ‘off’+ );

function search() {+ if ($cfg->isEnabled(‘new_search’)) {+ return do_new_search();+ }

// existing stuff}

Deploy that

Now we go crazy...

function do_new_search() { // exciting new stuff // that might or might not work // but we can deploy it anyway // since it’s flagged off}

Internal user testing

$config[‘new_search’] = array(+ ‘enabled’ => ‘rampup’,+ ‘rampup’ => array(+ ‘admin’ => true

$config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array(

+ ‘whitelist’ => array('zhida'), ‘admin’ => true ));

Whitelists

+ ‘group’ => 12345, ‘admin’ => true ));

Opt-in experiments

+ ‘percent’ => 1.5, ‘admin’ => true ));

$config[‘new_search’] = array(+ ‘enabled’ => ‘on’

If it works...

Order matters

Whitelist / Blacklist > Internal > Opt-in > Random

The framework

As easy as...

1. Pick a variant

As easy as...

1. Pick a variant

2. Do what it says

As easy as...

1. Pick a variant

2. Do what it says

3. Log the event

What's in a test?

Variants

• Key-value pairs

• interpreted by the app

• Name

• mostly for logging

SubjectIdProvider

• Why?

• hashing and other selectors

• logging

• Types of subjects

• Users...but not always

• Different groups of users - sellers vs buyers, etc.

• Different ways to identify them - signed in vs signed out

function getID()

Selectors

function select($subjectID) => Variant Name

Combining multiple selectors

• OR

• breaks blacklists

• AND

• breaks whitelists

• Sequence

• works!

Selector sequence

• Defines an ordering

• Returns A/B/C/... or <don't care>

Loggers

function log($testKey, $variantKey, $subjectKey)

More => better

• More data

• More ways to track

• access logs

• 3P analytics

• custom

Access log augmentation

• Apache note

• Lots of log analysis tools

• grep

• $$

3P Analytics

• Quick to start

• May be cheap

• Volume?

• Lag time?

• Flexibility / customization?

3P Analytics - how

• Custom variables

• take note of number & size limits

• Custom segments

• Canned metrics

3P Analytics - example

pageTracker._initData();

pageTracker._setCustomVar(2, "AB", "search_test.variantC", 3);

pageTracker._trackPageview();

</script>

Our own event tracking

• HTML beacons

• Hadoop

• Cloud

Web appHTML, JS

Hadoop

eventbeacon

Event log

Results

Break / hackhttps://github.com/etsy/ab

Building on top of the core API

Test builders

• Capture common patterns

• feature ramp ups

• opt-in experiments

• Help with test design

• weight equalization

• multivariate testing

Automatic Dispatchers

• Separate dispatching and work

• Work with components that have well-defined invocation APIs

• Define a particular level of granularity

• Feel like magic

Dispatcher example - MVC

• View dispatch

• Controller dispatch

• Spring framework, etc.

Selector Registry

• Reuse

• Clarity

• Documentation

$selectorReg = array( ‘staff’ => ‘InternalUserSelector’, ‘whitelist’ => ‘WhitelistSelector’, ‘percent’ => ‘WeightedSelector’);

Randomized Selector

What does it mean?

• Independent of subject attributes

What does it mean?

• Independent of other tests

What does it mean?

• Independent of other tests

• Independent of (coarse-grained) time

Persistence

• Better experience

Persistence

• Better data

Persistence

• Better data

• Multi-part tests

Persistence

• Better data

• Multi-part tests

• ...but not forever

Ramping up/down

• Vary group sizes

• Reduce risk

• Distribute load

Persistence + Ramping

• Minimize inconsistency

• Ramping up

• Should just add people to the treatment group

• Ramping down

• Should just remove part of the treatment group

rand()

• Explicit persistence

• Cookie

• DB

• Scaling

• Maintenance

Hashing

variant = H(id)

Hashing

variant = H(id)

Persistence

Hashing

variant = H(id)

Persistence

Hashing

variant = H(id)

Persistence

Attribute independence

Hashing

variant = H(id)

Persistence Attribute independence

Hashing

variant = H(id)

Persistence

Test independence?

Hashing

variant = H(test id, id)

Persistence

Test independence

Hashing

Persistence Test independenceAttribute independence

Hashing

Persistence

What else?

Test independenceAttribute independence

Hashing

Persistence

Weights!

Attribute independence Test independence

Hashing

h = H(test id, id)

Persistence Attribute independence Test independence

Hashing

h = H(test id, id)

variant = P(h, weights)

Persistence Attribute independence Test independence

Partitioning

Partition

Partitioning

Partition

Ramping up

Partition

Which hash function?

• MD5/SHA-256/...

• Test it!

• But be careful...

A/B + opt-in

• Need to separate the groups for analysis

• Solution: use more than 2 variants!

• Act according to variant properties

• Track by variant name

Analysis

... Confidence interval ... something something ... Binomial ... blah blah ...

• How sure are we?

• What if it were random?

Confidence Intervals

Binomial experiments

H T H T T T H T H H

T H T H T T H H T H

Results

Dashboards

A few test design tips

Whatʼs the question?

What metrics?

How much better?

• Different roles

• Old vs new

• Novelty

• Habit

• Expectation

• User types vary

• Activity patterns vary

• Site content might vary

• Performance might vary

• Full weeks are often a good starting point

Summary

Better living through experimentation

• More risk taking => better product

• MTTR

• Lower stress

You can too.

Building an experimentation framework

Technology