+ All Categories
Home > Data & Analytics > Implementing and analyzing online experiments

Implementing and analyzing online experiments

Date post: 21-Apr-2017
Category:
Upload: sean-taylor
View: 2,635 times
Download: 0 times
Share this document with a friend
32
IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTS SEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA
Transcript
Page 1: Implementing and analyzing online experiments

IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTSSEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA

Page 2: Implementing and analyzing online experiments

WHO AM I?

• Core Data Science Team at Facebook

• PhD from NYU in Information Systems

• Four academic papers employing online field experiments

• Teach and consult on experimental design at Facebook

http://seanjtaylor.comhttp://github.com/seanjtaylorhttp://facebook.com/seanjtaylor@seanjtaylor

Page 3: Implementing and analyzing online experiments

I ASSUME YOU KNOW

• Why causality matters

• A little bit of Python and R

• Basic statistics + linear regression

Page 4: Implementing and analyzing online experiments

SIMPLEST POSSIBLE EXPERIMENT

user_id version spent

123 B $10

596 A $0

456 A $4

991 B $9

def  get_version(user_id):          if  user_id  %  2:                  return  'A'          else:                  return  'B'  

>  t.test(c(0,  4),  c(10,  9))  

  Welch  Two  Sample  t-­‐test  

data:    c(0,  4)  and  c(10,  9)  t  =  -­‐3.638,  df  =  1.1245,  p-­‐value  =  0.1487  alternative  hypothesis:  true  difference  in  means  is  not  equal  to  0  95  percent  confidence  interval:    -­‐27.74338    12.74338  sample  estimates:  mean  of  x  mean  of  y                2.0              9.5

Page 5: Implementing and analyzing online experiments

FIN

Page 6: Implementing and analyzing online experiments

COMMON PROBLEMS

• Type I errors from measuring too many effects

• Type II and M errors from lack of power

• Repeated use of the same population (“pollution”)

• Type I errors from violation of the i.i.d. assumption

• Composing many changes into one experiment

Page 7: Implementing and analyzing online experiments

POWEROR

THE ONLY WAY TO TRULY FAIL AT AN EXPERIMENT

OR

THE SIZE OF YOUR CONFIDENCE INTERVALS

Page 8: Implementing and analyzing online experiments

ERRORS

• Type I: Thinking your metric changed when it didn’t. We usually bound this at 1 or 5%.

• Type II: Thinking your metric didn’t change when it did. You can control this through better planning.

Page 9: Implementing and analyzing online experiments

HOW TO MAKE TYPE I ERRORS

$ SpentTime Spent

Survey Satisfaction

oMale, <25

Female, <25

Male, >=25

Female, >=25

Measure a ton of metrics

Find

a su

bgro

up it

wor

ks o

n

Page 10: Implementing and analyzing online experiments

AVOID TYPE II ERRORS WITH POWER

1. Use enough subjects in your experiment.

2. Test a reasonably strong treatment. Remember: you care about the difference.

Page 11: Implementing and analyzing online experiments

POWER ANALYSIS

First step in designing an experiment is to determine how much data you’ll need to learn the answer to your question.

Process:

• set the smallest effect size you’d like to detect.

• simulate your experiment 200 times at various sample sizes

• count the number of simulated experiments where you correctly reject the null of effect=0.

Page 12: Implementing and analyzing online experiments

TYPE M ERRORS

• Magnitude error: reporting an effect size which is too large

• happens when your experiment is underpowered AND you only report the significant results

Page 13: Implementing and analyzing online experiments

IMPLEMENTATION

Page 14: Implementing and analyzing online experiments

PLANOUT: KEY IDEAS

• an experiment is just a pseudo-random mapping from (user, context) → parameters, and is serializable.

• persistent randomizations implemented through hash functions, salts make experiments orthogonal

• always log exposures (parameters assignment) to improve precision, provide randomization check

• namespaces create ability to do sequential experiments on new blocks of users

https://facebook.github.io/planout/

Page 15: Implementing and analyzing online experiments

A/B TESTING IN PLANOUTfrom  planout.ops.random  import  *  from  planout.experiment  import  SimpleExperiment  

class  ButtonCopyExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  `params`  is  always  the  first  argument.                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )  

#  Later  in  your  production  code:  from  myexperiments  import  ButtonCopyExperiment  

e  =  ButtonCopyExperiment(user_id=212)  print(e.get('button_text'))  

#  Event  later:  e  =  ButtonCopyExperiment(user_id=212)  e.log_event('purchase',  {'amount':  9.43})  

Page 16: Implementing and analyzing online experiments

PLANOUT LOGS → DATA

{"inputs":  {"user_id":  212},  "name":  "ButtonCopyExperiment",  "checksum":  "646e69a5",  "params":  {"button_text":  "Buy  later!"},  "time":  1437952369,  "salt":  "ButtonCopyExperiment",  "event":  “exposure"}  

{"inputs":  {"user_id":  212},  "name":  "ButtonCopyExperiment",  "checksum":  "646e69a5",  "params":  {"button_text":  "Buy  later!"},  "time":  1437952369,  "extra_data":  {"amount":  9.43},  "salt":  "ButtonCopyExperiment",  "event":  "purchase"}  

user_id button_text

123 Buy later!

596 Buy later!

456 Buy now!

991 Buy later!

user_id amount

123 $12

596 $9

Exposures

Purchases

Page 17: Implementing and analyzing online experiments

ADVANCED DESIGN 1: FACTORIAL DESIGN

• Can use conditional logic as well as other random assignment operators: RandomInteger, RandomFloat, WeightedChoice, Sample.

class  FactorialExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )                  params.button_color  =  UniformChoice(                          choices=["blue",  "orange"],                          unit=user_id                  )  

Page 18: Implementing and analyzing online experiments

ADVANCED DESIGN 2: INCREMENTAL CHANGES

##  We're  going  to  try  two  different  button  redesigns.  class  FirstExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params  

class  SecondExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params  differently                    class  ButtonNamespace(SimpleNamespace):          def  setup(self):                  self.name  =  'button_experiment_sequence'                  self.primary_unit  =  'user_id'                  self.num_segments  =  1000  

       def  setup_experiments():                  #  allocate  and  deallocate  experiments  here                  #  First  gets  100  out  of  1000  segments.                  self.add_experiment('first',  FirstExperiment,  100)                      self.add_experiment('second',  SecondExperiment,  100)  

Page 19: Implementing and analyzing online experiments

ADVANCED DESIGN 3: WITHIN-SUBJECTS

Previous experiments persistently assigned same treatment to user, but unit of analysis can be more complex:

class  DiscountExperiment(SimpleExperiment):          def  assign(self,  params,  user_id,  item_id):                  params.discount  =  BernoulliTrial(p=0.1,  unit=[user_id,  item_id])                  if  params.discount:                          params.discount_amount  =  RandomInteger(                                  min=5,  max=15,  unit=user_id                          )                  else:                          params.discount_amount  =  0  

e  =  DiscountExperiment(user_id=212,  item_id=2)  print(e.get('discount_amount'))  

Page 20: Implementing and analyzing online experiments

ANALYSIS

Page 21: Implementing and analyzing online experiments

THE IDEAL DATA SET

Subject / User

Gender Age Button Size

Button Text

Spent Bounce

Erin F 22 Large Buy Now!

$20 0

Ashley F 29 Large Buy Later!

$4 0

Gary M 34 Small Buy Now!

$0 1

Leo M 18 Large Buy Now!

$0 1

Ed M 46 Small Buy Later!

$9 0

Sam M 25 Small Buy Now!

$5 0

Independent Observations

Randomly Assigned MetricsPre-experiment

Covariates{ { {

{

Page 22: Implementing and analyzing online experiments

SIMPLEST CASE: OLS

>  summary(lm(spent  ~  button.size,  data  =  df))  

Call:  lm(formula  =  spent  ~  button.size,  data  =  df)  

Residuals:          1          2          3          4          5          6      10.0    -­‐0.5    -­‐4.5  -­‐10.0      4.5      0.5    

Coefficients:                                            Estimate  Std.  Error  t  value  Pr(>|t|)  (Intercept)                        10.000            5.489      1.822        0.143  factor(button.size)s      -­‐5.500            6.722    -­‐0.818        0.459  

Residual  standard  error:  7.762  on  4  degrees  of  freedom  Multiple  R-­‐squared:    0.1434,   Adjusted  R-­‐squared:    -­‐0.07079    F-­‐statistic:  0.6694  on  1  and  4  DF,    p-­‐value:  0.4592  

Page 23: Implementing and analyzing online experiments

DATA REDUCTION

Subject Xi Di Yi

Evan M 0 1

Ashley F 0 1

Greg M 1 0

Leena F 1 0

Ema F 0 0

Seamus M 1 1

X D Y Cases

M 0 1 1

M 1 1 1

F 0 1 1

F 1 1 0

M 0 0 0

M 1 0 1

F 0 0 1

F 1 0 1

N # treatments X # groups X #outcomes

Page 24: Implementing and analyzing online experiments

source('css_stats.R')  

reduced  <-­‐  df  %>%      mutate(rounded.spent  =  round(spent,  0))  %>%      group_by(button.size,  rounded.spent)  %>%      summarise(n  =  n())  

>  lm(rounded.spent  ~  button.size,  data  =  reduced,  weights  =  n)  %>%  +      coeftest(vcov  =  sandwich.lm)  

t  test  of  coefficients:  

                         Estimate  Std.  Error  t  value    Pr(>|t|)          (Intercept)      7.43137        0.45162  16.4548  7.522e-­‐14  ***  button.sizes  -­‐2.45178        0.59032  -­‐4.1533  0.0004149  ***  -­‐-­‐-­‐  Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1  

DATA REDUCTION + WEIGHTED OLS

Page 25: Implementing and analyzing online experiments

FACTORIAL DESIGNS

• Identify two types of effects: marginal and interactions. Need to fix one group as the baseline.

>  coeftest(lm(spent  ~  button.size  *  button.text,  data  =  df))  

t  test  of  coefficients:  

                                                   Estimate  Std.  Error  t  value    Pr(>|t|)          (Intercept)                                6.79643        0.62998  10.7884  <  2.2e-­‐16  ***  button.sizes                            -­‐2.43253        0.86673  -­‐2.8066    0.006064  **    button.textn                              2.11611        0.86673    2.4415    0.016458  *      button.sizes:button.textn  -­‐2.57660        1.27584  -­‐2.0195    0.046219  *      -­‐-­‐-­‐  Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1  

Page 26: Implementing and analyzing online experiments

USING COVARIATES TO GAIN PRECISION

• With simple random assignment, using covariates is not necessary.

• However, you can improve precision of ATE estimates if covariates explain a lot of variation in the potential outcomes.

• Can be added to a linear model and SEs should get smaller if they are helpful.

Page 27: Implementing and analyzing online experiments

NON-IID DATA

• Repeated observations of the same user are not independent.

• Ditto if you ‘re experimenting on certain items only.

• If you ignore dependent data, the true confidence intervals are larger than you think.

Subject / User

Item Button Size

Spent

Erin Shirt Large $20

Erin Socks Large $4

Erin Pants Large $0

Leo Shirt Large $0

Ed Shirt Small $9

Ed Socks Small $5

Page 28: Implementing and analyzing online experiments

THE BOOTSTRAP

R1

All Your Data

R2

R500

Generate random sub-samples

s1

s2

s500

Compute statistics or estimate model

parameters

}0.0

2.5

5.0

7.5

-2 -1 0 1 2Statistic

Count

Get a distribution over statistic of interest

(e.g. the treatment effect)

- take mean - CIs == 95% quantiles - SEs == standard deviation

Page 29: Implementing and analyzing online experiments

USER AND USER-ITEM BOOTSTRAPS

source('css_stats.R')  library(broom)  ##  for  extracting  model  coefficients  

fitter  <-­‐  function(.data)  {          lm(summary  ~  opposed,  data  =  .data,  weights  =  .weights)  %>%          tidy  }  

iid.replicates        <-­‐  iid.bootstrap(df,  fitter,  .combine  =  bind_rows)  oneway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id'),  fitter,  .combine  =  bind_rows)  twoway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id',  'item_id'),  fitter,  .combine  =  bind_rows)  

>  head(iid.replicates)                    term      estimate    std.error  statistic            p.value  1    (Intercept)    0.4700000  0.04795154    9.801561  6.296919e-­‐18  2  button.sizes  -­‐0.2200000  0.08003333  -­‐2.748855  6.695621e-­‐03  3    (Intercept)    0.4250000  0.05307832    8.007036  5.768641e-­‐13  4  button.sizes  -­‐0.1750000  0.08456729  -­‐2.069358  4.049329e-­‐02  5    (Intercept)    0.4137931  0.05141050    8.048805  4.118301e-­‐13  6  button.sizes  -­‐0.1429598  0.08621804  -­‐1.658119  9.965016e-­‐02  

Page 30: Implementing and analyzing online experiments

DEPENDENCE CHANGES CONFIDENCE INTERVALS

Page 31: Implementing and analyzing online experiments

DATA REDUCTION WITH DEPENDENT DATA

Subject Di Yij

Evan 1 1

Evan 1 0

Ashley 0 1

Ashley 0 1

Ashley 0 1

Greg 1 0

Leena 1 0

Leena 1 1

Ema 0 0

Seamus 1 1

Create bootstrap replicates

R1

R2

R3

reduce the replicates as if they’re i.i.d.

r1

r2

r3

s1

s2

s3

compute statistics on reduced data

Page 32: Implementing and analyzing online experiments

THANKS! HERE ARE SOME RESOURCES:

• Me: http://seanjtaylor.com

• These slides:http://www.slideshare.net/seanjtaylor/implementing-and-analyzing-online-experiments

• Full Featured Tutorial: http://eytan.github.io/www-15-tutorial/

• “Field Experiments” by Gerber and Green


Recommended