A/B Testing Theory and Practice (TrueMotion Data Science Lunch Seminar)

transcript

DataScienceLunchSeminar:A/BTestingTheoryandPractice

NicholasArcolano19September2016

WhatisanA/Btest?Considerarandomexperimentwithbinaryoutcome

Coinflip,diseaserecovery,purchasingaproduct("conversion")Assumethereissometrue"baseline"probabilityofapositiveoutcomeWechangesomethingthat(wethink)willalterthisbaselineHowdoweknowifitactuallydid?

Experiment!Theoriginalversionisthecontrola.k.a."variantA"Thenewversionisthetesta.k.a."variantB"

IfAandBare"differentenough",wedecideourinterventionhadaneffect—otherwise,wedecidethatitdidn't

A"simple"exampleConsidertwocoins,withunknownprobabilitesofheads and ,andassumeoneofthefollowingtwohypothesesistrue:

(nullhypothesis):(alternatehypothesis):

Howdowedecidewhichistrue?Experiment!

FlipthembothandseehowdifferenttheiroutcomesareGiven flipsofeachcoin,wewillobservesomenumber headsforcoin#1and headsforcoin#2

H0 =p1 p2H1 <p1 p2

n m1m2

Ifweknewbothdistributions,wecouldjustdotheoptimalthingprescribedbyclassicalbinaryhypothesistesting—butthiswouldrequireknowing andInstead,weneedsomeotherstatisticaltestthatwilltake , ,and andgiveusanumberwecanthresholdtomakeadecision

p1 p2n m1 m2

Areviewofstatisticaltests,errors,andpowerBasicapproachtostatisticaltesting:

Determineateststatistic:randomvariablethatdependson , ,andWantastatisticwhosedistributiongiventhenullhypothesisiscomputable(exactlyorapproximately)Ifthedataweobserveputsusinthetailsofthedistribution,wesaythat istoounlikelyand"rejectthenullhypothesis"(choose )-value:tailprobabilityofthesamplingdistributiongiventhenull

hypothesisistrue( -valuetoosmall,rejectthenull)

n m1m2

H0 H1p

Oftensummarizethedataasa2x2contingencytable

Heads TailsRowtotals

Coin#1

Coin#2

Columntotals

Statisticaltesttakesthistableandproducesa -value,whichwethenthreshold(e.g. )

m1 n − m1 nm2 n − m2 n

+m1 m2 2n − −m1 m2 2n

pp < 0.05

TypesoferrorsFourpotentialoutcomesofthetest:

istrue,choose :truepositive(correctdetection)istrue,choose :truenegativeistrue,choose :falsepositive(TypeIerror)istrue,choose :falsenegative(TypeIIerror)

H1 H1H0 H0H0 H1H1 H0

PowerandfalsepositiverateDenotetheprobabiliesoffalsepositivesandfalsenegativesas andSince -valuerepresentsthetailprobabilityunderthenull,rejectingcorrespondstofalsepositiverateof (foraone-sidedtest)Refertoprobabilityofcorrectdetection

asthepowerofthetest

α βp p < α

Pr (choose | true) = 1 − βH1 H1

RelationshiptoprecisionandrecallAssumewedothistestalargenumberoftimes,sothatobservedratesofsuccess/failurerepresenttrueprobabilitiesCountsforeachpossibleoutcome , , ,

Falsealarmrate:

Recall(correctdetectionrate):

Precision:

TP TN FP FN

α = FPFP+TN

R = 1 − β = TPTP+FN

P = TPTP+FP

Wealsohaveapriorprobabilityfor

Traditionalhypothesistestingdoesn'treallytakethisintoaccount

Therelationshipbetween , ,precisionandpriorisgivenby

So,foratestwithfixedpowerandfalsepositiverate,precisionwillscalewiththepriorprobabilityof

π =TP + FN

TP + FN + TN + FP

α βα = (1 − β)

P1 − P

π1 − π

Examplesoftests

Fisher'sexacttestObservethatunderthenull,therowandcolumntotalsfollowahypergeometricdistributionRejectthenullifthedifferencesbetweentherowandcolumntotalsproducesa -valuelessthanthegiventhreshold"Exacttest":doesn'tneedtoholdonlywhen islargeTypicallyusedwhensamplesizesare"small"Sincedistributioncanonlytakeondiscretevalues,canbeconservative

Pearson'schi-squaredtestComparetheobservedfrequenciesofsuccess andIf istrue,thenthevarianceof is

Theteststatistic

underthenullconvergestoa distributionComputethechi-squaretailprobabilityoftheteststatistic,rejectthenullifitexceedsthethreshold

/nm1 /nm2H0 /n − /nm1 m2

=σ2 2 (1 − )π̂ π̂ n

=π̂ +m1 m22n

=z2 ( /n − /n)m1 m22

σ2χ2

BacktoourexampleRecall:

(nullhypothesis):(alternatehypothesis):

Assumewegettoflipeachcoin times,andlet'slookatsomeexamplesforeachhypothesis

H0 =p1 p2H1 <p1 p2

n = 100

Case#1:AlternatehypothesisistrueIn [3]: n = 100

p1 = 0.40p2 = 0.60

# Compute distributionsx = np.arange(0, n+1)pmf1 = stats.binom.pmf(x, n, p1)pmf2 = stats.binom.pmf(x, n, p2)plot(x, pmf1, pmf2)

In [4]:

In [5]:

# Example outcomesm1, m2 = 40, 60table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))

0.00720957076474 (reject H0)

m1, m2 = 43, 57table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))

0.0659920550593 (accept H0)

Case#2:NullhypothesistrueIn [6]: n = 100

p1 = 0.50p2 = 0.50

# Compute distributionsx = np.arange(0, n+1)pmf1 = stats.binom.pmf(x, n, p1)pmf2 = stats.binom.pmf(x, n, p2)plot(x, pmf1, pmf2)

In [7]:

In [8]:

# Example outcomesm1, m2 = 49, 51table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))

0.887537083982 (accept H0)# Example outcomesm1, m2 = 42, 58table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))

0.0338948535247 (reject H0)

SamplesizecalculationOftenwhatwereallywanttoknowis:howmanyflipstoweneedtoreachacertainlevelofconfidencethatwearereallyobservingadifference?

FactorsaffectingrequiredsamplesizeBaselineprobability :howoftendoesanythinginterestinghappen?Minimumobservabledifferencethatwewanttobeabletodetectbetweenand

Desiredpowerofthetest:ifthereisarealdifference,howlikelydowewanttobetoobserveit?Desiredfalsepositiverateofthetest

Soinpractice,ifwehaveagoodguessat andtheminimum thatwecanacceptdetecting,wecanestimateaminimum

p1 p2n

Casagrandeetal(1978)Approximateformulagivesthedesiredsamplesize asafunctionof , , ,and :

where isa "correctionfactor"givenby

with andwhere denotesthestandardnormalquantilefunction,i.e.

islocationofthe -thquantilefor

n p1 p2 α β

n = A⎡

⎣⎢⎢⎢

1 + 1 + 4( − )p1 p2A

‾ ‾‾‾‾‾‾‾‾‾‾√2( − )p1 p2

⎦⎥⎥⎥

A = ,[ + ]z1−α 2 (1 − )p̄ p̄‾ ‾‾‾‾‾‾‾‾√ z1−β (1 − ) + (1 − )p1 p1 p2 p2‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√2

= ( + )/2p̄ p1 p2 zp= (p)zp Φ−1 p N(0, 1)

ExampleIn [9]: p1, p2 = 0.40, 0.60

alpha = 0.05beta = 0.05

# Evaluate quantile functionsp_bar = (p1 + p2)/2.0za = stats.norm.ppf(1 - alpha/2) # Two-sided testzb = stats.norm.ppf(1 - beta)

# Compute correction factorA = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2

# Estimate samples requiredn = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2

print n

149.285261921

Amorepractical(andscarier)exampleAssumewehave5.00%conversiononsomethingwecareabout(e.g.click-throughonapurchasepage)Weintroduceafeaturethatwethinkwillchangeconversionsby3%(i.e.from5.00%to5.15%)Wewant95%powerand5%falsepositiverate

In [10]:

So,fortestandcontrolcombinedwe'llneedatleast 1.1millionusers.

p1, p2 = 0.0500, 0.0515alpha = 0.05beta = 0.05

# Evaluate quantile functionsp_bar = (p1 + p2)/2.0za = stats.norm.ppf(1 - alpha/2) # Two-sided testzb = stats.norm.ppf(1 - beta)

# Compute correction factorA = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2

# Estimate samples requiredn = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2

print n

555118.763831

Also,let'sverifythatthiscalculationevenworks...

In [11]: n = 555119n_trials = 10000

# Simulate experimental results when null is truecontrol0 = stats.binom.rvs(n, p1, size=n_trials)test0 = stats.binom.rvs(n, p1, size=n_trials) # Test and control are the sametables0 = [[[a, n-a], [b, n-b]] for a, b in zip(control0, test0)]results0 = [stats.chi2_contingency(T) for T in tables0]decisions0 = [x[1] <= alpha for x in results0] # Simulate Experimental results when alternate is truecontrol1 = stats.binom.rvs(n, p1, size=n_trials)test1 = stats.binom.rvs(n, p2, size=n_trials) # Test and control are differenttables1 = [[[a, n-a], [b, n-b]] for a, b in zip(control1, test1)]results1 = [stats.chi2_contingency(T) for T in tables1]decisions1 = [x[1] <= alpha for x in results1]

# Compute false alarm and correct detection ratesalpha_est = sum(decisions0)/float(n_trials)power_est = sum(decisions1)/float(n_trials)

print('Theoretical false alarm rate = {:0.4f}, '.format(alpha) + 'empirical false alarm rate = {:0.4f}'.format(alpha_est))print('Theoretical power = {:0.4f}, '.format(1 - beta) + 'empirical power = {:0.4f}'.format(power_est))

Theoretical false alarm rate = 0.0500, empirical false alarm rate = 0.0482Theoretical power = 0.9500, empirical power = 0.9466

Whatifnistoobig?Themainthingsinfluencing are

Howextreme is—veryraresuccessesmakeithardtoreachsignificanceThedifferencebetween and —smalldifferencesaremuchhardertomeasure

Whatcanwedoif istoobigtohandle?

Typicallywewon'tmesswith and toomuchSo,ouronlyoptionsaretoadjustwhatweexpecttogetfor and (i.e.changeourminimummeasurableeffect)Or,wecantrytoincrease bymeasuringsomethingthatismorecommon(e.g.clicksinsteadofpurchases)

p1p1 p2

α βp1 p2

PracticalissueswithA/BtestingSometimesit'shardtotargettherightgroup(e.g.emailtests)It'seasytoscrewthemup

UnexpectedvariationsbetweencontrolandtestContaminationbetweentests(testcrossover)Randomizationissues(e.g.individualsvsgroups)

People(especiallythoseoutsideofdatascience)aretemptedtoabusethemMultipletestingSearchingforfalsepositives

Issueofpriorprobabilities

Canweknowifatestisa"surething"ornot?Ifwedid,thenshouldweevenbetestingit?

Overall,youcanspendalotoftimeandeffort,especiallyifyouwanttomeasuresmallchangesinrarephenomena

SomealternativestotraditionalA/Btesting

Multi-armedbandittheory

ApproachesforsimultaneousexplorationandexploitationGivenasetofrandomexperimentsIcouldperform,howdoIchooseamongthem(inorderandquantity)?Appropriatewhenyouwantto"earnwhileyoulearn"Goodforquicklyexploitingshortwindowsofopportunity

Sequentialtesting

Intraditionaltesting("fixedhorizon"),wecan'tkeeplookingatthedataasitcomesinandthenquitwhenwe'resuccessful,becausewewillinflateourfalsepositiverateBenjaminandHochberg(1995)–approachtocontrollingfalsediscoveryrateforsequentialmeasurementsLikelihoodratiotestthatconvergestothe"true"falsediscoveryrateovertimeThisiswhatthe statsengineisbuiltonOptimizely

Notactuallytesting

Wedon'talwaysneedtoA/BtestTestingrequiresengineeringanddatascienceresourcesPotentialupside(e.g.intermsofsavedfutureeffortormitigationofrisk)hastooutweightthecostofdeveloping,performing,andanalyzingthetest

A/B Testing Theory and Practice (TrueMotion Data Science Lunch Seminar)

Data & Analytics