Post on 10-Feb-2017
transcript
DataScienceLunchSeminar:A/BTestingTheoryandPractice
NicholasArcolano19September2016
WhatisanA/Btest?Considerarandomexperimentwithbinaryoutcome
Coinflip,diseaserecovery,purchasingaproduct("conversion")Assumethereissometrue"baseline"probabilityofapositiveoutcomeWechangesomethingthat(wethink)willalterthisbaselineHowdoweknowifitactuallydid?
Experiment!Theoriginalversionisthecontrola.k.a."variantA"Thenewversionisthetesta.k.a."variantB"
IfAandBare"differentenough",wedecideourinterventionhadaneffect—otherwise,wedecidethatitdidn't
A"simple"exampleConsidertwocoins,withunknownprobabilitesofheads and ,andassumeoneofthefollowingtwohypothesesistrue:
(nullhypothesis):(alternatehypothesis):
Howdowedecidewhichistrue?Experiment!
FlipthembothandseehowdifferenttheiroutcomesareGiven flipsofeachcoin,wewillobservesomenumber headsforcoin#1and headsforcoin#2
p1 p2
H0 =p1 p2H1 <p1 p2
n m1m2
Ifweknewbothdistributions,wecouldjustdotheoptimalthingprescribedbyclassicalbinaryhypothesistesting—butthiswouldrequireknowing andInstead,weneedsomeotherstatisticaltestthatwilltake , ,and andgiveusanumberwecanthresholdtomakeadecision
p1 p2n m1 m2
Areviewofstatisticaltests,errors,andpowerBasicapproachtostatisticaltesting:
Determineateststatistic:randomvariablethatdependson , ,andWantastatisticwhosedistributiongiventhenullhypothesisiscomputable(exactlyorapproximately)Ifthedataweobserveputsusinthetailsofthedistribution,wesaythat istoounlikelyand"rejectthenullhypothesis"(choose )-value:tailprobabilityofthesamplingdistributiongiventhenull
hypothesisistrue( -valuetoosmall,rejectthenull)
n m1m2
H0 H1p
p
Oftensummarizethedataasa2x2contingencytable
Heads TailsRowtotals
Coin#1
Coin#2
Columntotals
Statisticaltesttakesthistableandproducesa -value,whichwethenthreshold(e.g. )
m1 n − m1 nm2 n − m2 n
+m1 m2 2n − −m1 m2 2n
pp < 0.05
TypesoferrorsFourpotentialoutcomesofthetest:
istrue,choose :truepositive(correctdetection)istrue,choose :truenegativeistrue,choose :falsepositive(TypeIerror)istrue,choose :falsenegative(TypeIIerror)
H1 H1H0 H0H0 H1H1 H0
PowerandfalsepositiverateDenotetheprobabiliesoffalsepositivesandfalsenegativesas andSince -valuerepresentsthetailprobabilityunderthenull,rejectingcorrespondstofalsepositiverateof (foraone-sidedtest)Refertoprobabilityofcorrectdetection
asthepowerofthetest
α βp p < α
α
Pr (choose | true) = 1 − βH1 H1
RelationshiptoprecisionandrecallAssumewedothistestalargenumberoftimes,sothatobservedratesofsuccess/failurerepresenttrueprobabilitiesCountsforeachpossibleoutcome , , ,
Falsealarmrate:
Recall(correctdetectionrate):
Precision:
TP TN FP FN
α = FPFP+TN
R = 1 − β = TPTP+FN
P = TPTP+FP
Wealsohaveapriorprobabilityfor
Traditionalhypothesistestingdoesn'treallytakethisintoaccount
Therelationshipbetween , ,precisionandpriorisgivenby
So,foratestwithfixedpowerandfalsepositiverate,precisionwillscalewiththepriorprobabilityof
H1
π =TP + FN
TP + FN + TN + FP
α βα = (1 − β)
P1 − P
π1 − π
H1
Examplesoftests
Fisher'sexacttestObservethatunderthenull,therowandcolumntotalsfollowahypergeometricdistributionRejectthenullifthedifferencesbetweentherowandcolumntotalsproducesa -valuelessthanthegiventhreshold"Exacttest":doesn'tneedtoholdonlywhen islargeTypicallyusedwhensamplesizesare"small"Sincedistributioncanonlytakeondiscretevalues,canbeconservative
pn
Pearson'schi-squaredtestComparetheobservedfrequenciesofsuccess andIf istrue,thenthevarianceof is
where
Theteststatistic
underthenullconvergestoa distributionComputethechi-squaretailprobabilityoftheteststatistic,rejectthenullifitexceedsthethreshold
/nm1 /nm2H0 /n − /nm1 m2
=σ2 2 (1 − )π̂ π̂ n
=π̂ +m1 m22n
=z2 ( /n − /n)m1 m22
σ2χ2
BacktoourexampleRecall:
(nullhypothesis):(alternatehypothesis):
Assumewegettoflipeachcoin times,andlet'slookatsomeexamplesforeachhypothesis
H0 =p1 p2H1 <p1 p2
n = 100
Case#1:AlternatehypothesisistrueIn [3]: n = 100
p1 = 0.40p2 = 0.60
# Compute distributionsx = np.arange(0, n+1)pmf1 = stats.binom.pmf(x, n, p1)pmf2 = stats.binom.pmf(x, n, p2)plot(x, pmf1, pmf2)
In [4]:
In [5]:
# Example outcomesm1, m2 = 40, 60table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))
0.00720957076474 (reject H0)
m1, m2 = 43, 57table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))
0.0659920550593 (accept H0)
Case#2:NullhypothesistrueIn [6]: n = 100
p1 = 0.50p2 = 0.50
# Compute distributionsx = np.arange(0, n+1)pmf1 = stats.binom.pmf(x, n, p1)pmf2 = stats.binom.pmf(x, n, p2)plot(x, pmf1, pmf2)
In [7]:
In [8]:
# Example outcomesm1, m2 = 49, 51table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))
0.887537083982 (accept H0)# Example outcomesm1, m2 = 42, 58table = [[m1, n-m1], [m2, n-m2]]chi2, pval, dof, expected = stats.chi2_contingency(table)decision = 'reject H0' if pval < 0.05 else 'accept H0'print('{} ({})'.format(pval, decision))
0.0338948535247 (reject H0)
SamplesizecalculationOftenwhatwereallywanttoknowis:howmanyflipstoweneedtoreachacertainlevelofconfidencethatwearereallyobservingadifference?
FactorsaffectingrequiredsamplesizeBaselineprobability :howoftendoesanythinginterestinghappen?Minimumobservabledifferencethatwewanttobeabletodetectbetweenand
Desiredpowerofthetest:ifthereisarealdifference,howlikelydowewanttobetoobserveit?Desiredfalsepositiverateofthetest
Soinpractice,ifwehaveagoodguessat andtheminimum thatwecanacceptdetecting,wecanestimateaminimum
p1
p2 p1
p1 p2n
Casagrandeetal(1978)Approximateformulagivesthedesiredsamplesize asafunctionof , , ,and :
where isa "correctionfactor"givenby
with andwhere denotesthestandardnormalquantilefunction,i.e.
islocationofthe -thquantilefor
n p1 p2 α β
n = A⎡
⎣⎢⎢⎢
1 + 1 + 4( − )p1 p2A
‾ ‾‾‾‾‾‾‾‾‾‾√2( − )p1 p2
⎤
⎦⎥⎥⎥
2
A χ2
A = ,[ + ]z1−α 2 (1 − )p̄ p̄‾ ‾‾‾‾‾‾‾‾√ z1−β (1 − ) + (1 − )p1 p1 p2 p2‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√2
= ( + )/2p̄ p1 p2 zp= (p)zp Φ−1 p N(0, 1)
ExampleIn [9]: p1, p2 = 0.40, 0.60
alpha = 0.05beta = 0.05
# Evaluate quantile functionsp_bar = (p1 + p2)/2.0za = stats.norm.ppf(1 - alpha/2) # Two-sided testzb = stats.norm.ppf(1 - beta)
# Compute correction factorA = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2
# Estimate samples requiredn = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2
print n
149.285261921
Amorepractical(andscarier)exampleAssumewehave5.00%conversiononsomethingwecareabout(e.g.click-throughonapurchasepage)Weintroduceafeaturethatwethinkwillchangeconversionsby3%(i.e.from5.00%to5.15%)Wewant95%powerand5%falsepositiverate
In [10]:
So,fortestandcontrolcombinedwe'llneedatleast 1.1millionusers.
p1, p2 = 0.0500, 0.0515alpha = 0.05beta = 0.05
# Evaluate quantile functionsp_bar = (p1 + p2)/2.0za = stats.norm.ppf(1 - alpha/2) # Two-sided testzb = stats.norm.ppf(1 - beta)
# Compute correction factorA = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2
# Estimate samples requiredn = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2
print n
555118.763831
2n =
Also,let'sverifythatthiscalculationevenworks...
In [11]: n = 555119n_trials = 10000
# Simulate experimental results when null is truecontrol0 = stats.binom.rvs(n, p1, size=n_trials)test0 = stats.binom.rvs(n, p1, size=n_trials) # Test and control are the sametables0 = [[[a, n-a], [b, n-b]] for a, b in zip(control0, test0)]results0 = [stats.chi2_contingency(T) for T in tables0]decisions0 = [x[1] <= alpha for x in results0] # Simulate Experimental results when alternate is truecontrol1 = stats.binom.rvs(n, p1, size=n_trials)test1 = stats.binom.rvs(n, p2, size=n_trials) # Test and control are differenttables1 = [[[a, n-a], [b, n-b]] for a, b in zip(control1, test1)]results1 = [stats.chi2_contingency(T) for T in tables1]decisions1 = [x[1] <= alpha for x in results1]
# Compute false alarm and correct detection ratesalpha_est = sum(decisions0)/float(n_trials)power_est = sum(decisions1)/float(n_trials)
print('Theoretical false alarm rate = {:0.4f}, '.format(alpha) + 'empirical false alarm rate = {:0.4f}'.format(alpha_est))print('Theoretical power = {:0.4f}, '.format(1 - beta) + 'empirical power = {:0.4f}'.format(power_est))
Theoretical false alarm rate = 0.0500, empirical false alarm rate = 0.0482Theoretical power = 0.9500, empirical power = 0.9466
Whatifnistoobig?Themainthingsinfluencing are
Howextreme is—veryraresuccessesmakeithardtoreachsignificanceThedifferencebetween and —smalldifferencesaremuchhardertomeasure
Whatcanwedoif istoobigtohandle?
Typicallywewon'tmesswith and toomuchSo,ouronlyoptionsaretoadjustwhatweexpecttogetfor and (i.e.changeourminimummeasurableeffect)Or,wecantrytoincrease bymeasuringsomethingthatismorecommon(e.g.clicksinsteadofpurchases)
n
p1p1 p2
n
α βp1 p2
p1
PracticalissueswithA/BtestingSometimesit'shardtotargettherightgroup(e.g.emailtests)It'seasytoscrewthemup
UnexpectedvariationsbetweencontrolandtestContaminationbetweentests(testcrossover)Randomizationissues(e.g.individualsvsgroups)
People(especiallythoseoutsideofdatascience)aretemptedtoabusethemMultipletestingSearchingforfalsepositives
Issueofpriorprobabilities
Canweknowifatestisa"surething"ornot?Ifwedid,thenshouldweevenbetestingit?
Overall,youcanspendalotoftimeandeffort,especiallyifyouwanttomeasuresmallchangesinrarephenomena
SomealternativestotraditionalA/Btesting
Multi-armedbandittheory
ApproachesforsimultaneousexplorationandexploitationGivenasetofrandomexperimentsIcouldperform,howdoIchooseamongthem(inorderandquantity)?Appropriatewhenyouwantto"earnwhileyoulearn"Goodforquicklyexploitingshortwindowsofopportunity
Sequentialtesting
Intraditionaltesting("fixedhorizon"),wecan'tkeeplookingatthedataasitcomesinandthenquitwhenwe'resuccessful,becausewewillinflateourfalsepositiverateBenjaminandHochberg(1995)–approachtocontrollingfalsediscoveryrateforsequentialmeasurementsLikelihoodratiotestthatconvergestothe"true"falsediscoveryrateovertimeThisiswhatthe statsengineisbuiltonOptimizely
Notactuallytesting
Wedon'talwaysneedtoA/BtestTestingrequiresengineeringanddatascienceresourcesPotentialupside(e.g.intermsofsavedfutureeffortormitigationofrisk)hastooutweightthecostofdeveloping,performing,andanalyzingthetest