Date post: | 16-Mar-2018 |
Category: |
Data & Analytics |
Upload: | prasad-chalasani |
View: | 426 times |
Download: | 0 times |
Estimating Causal Effect of Ads in aReal-Time-Bidding Platform
Prasad Chalasani (SVP Data Science, MediaMath)
Sep 24, 2016
Project PlaceboOr,
How to Measure Causal Effect of Ads in an RTB Platform
Placebo Team (alphabetical):
I Ari Buchalter (President, Technology; co-founder)I Prasad ChalasaniI Himanish KusharyI Jason LeiI Jonathan MarshallI Michael NeissI Tristan PironI Sara SkrmettiI Jawad StouliI Jaynth ThiagarajanI Ezra Winston
I listen to ~ 100 Bln ad opportunities daily
I respond with optimal bids within milliseconds
I petabytes of data (ad impressions, visits, clicks, conversions)
I listen to ~ 100 Bln ad opportunities daily
I respond with optimal bids within milliseconds
I petabytes of data (ad impressions, visits, clicks, conversions)
I listen to ~ 100 Bln ad opportunities daily
I respond with optimal bids within milliseconds
I petabytes of data (ad impressions, visits, clicks, conversions)
I listen to ~ 100 Bln ad opportunities daily
I respond with optimal bids within milliseconds
I petabytes of data (ad impressions, visits, clicks, conversions)
Key Conceptual Take-aways
I Definition of causal effect
I Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine Learning
I Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wasteful
I Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)
I MediaMath’s solutionI Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence Bounds
I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:
I Long-running experimentsI Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:I Long-running experiments
I Multiple cookies per user
Key Conceptual Take-aways
I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform
I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution
I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)
I Complications unique to our setting:I Long-running experimentsI Multiple cookies per user
Ad impact measurement
I Advertisers want to know the impact of showing ads to people.
Measuring Ad Impact: Two Approaches
I Observational studies:
I Compare people who happen to be exposed vs not exposedI Bias a big issue
I Randomized tests:
I Randomly assign people to test (exposed), control (un-exposed)
Measuring Ad Impact: Two Approaches
I Observational studies:I Compare people who happen to be exposed vs not exposed
I Bias a big issue
I Randomized tests:
I Randomly assign people to test (exposed), control (un-exposed)
Measuring Ad Impact: Two Approaches
I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue
I Randomized tests:
I Randomly assign people to test (exposed), control (un-exposed)
Measuring Ad Impact: Two Approaches
I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue
I Randomized tests:
I Randomly assign people to test (exposed), control (un-exposed)
Measuring Ad Impact: Two Approaches
I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue
I Randomized tests:I Randomly assign people to test (exposed), control (un-exposed)
Causal Effect: the questions to ask
When a set of people U is exposed to ads,
I what is the avg response-rate R1 of the people in U?
I what would have been the response rate R0 of U, if theyhad not seen the ad?
I relative causal effect, or causal lift = R1/R0 − 1
Causal Effect: the questions to ask
When a set of people U is exposed to ads,
I what is the avg response-rate R1 of the people in U?I what would have been the response rate R0 of U, if theyhad not seen the ad?
I relative causal effect, or causal lift = R1/R0 − 1
Causal Effect: the questions to ask
When a set of people U is exposed to ads,
I what is the avg response-rate R1 of the people in U?I what would have been the response rate R0 of U, if theyhad not seen the ad?
I relative causal effect, or causal lift = R1/R0 − 1
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )
I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.
I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an ad
I Yi(1) = response when exposed to an adI Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.
I Observed response: Y obsi = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)
I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactual
I if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactualI Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of features
I e.g. (dayOfWeek, age, location, web-domain)I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Causal Effect: Notation
I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:
I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad
I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs
i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual
I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)
I Unit level causal effect is impossible to measure:
τi = Yi(1)− Yi(0)
Average Causal/Treatment Effects
Average Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Average Causal/Treatment EffectsAverage Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Average Causal/Treatment EffectsAverage Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Average Causal/Treatment EffectsAverage Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Average Causal/Treatment EffectsAverage Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Average Causal/Treatment EffectsAverage Treatment Effect (ATE)
ATE = E[Yi(1)− Yi(0)]
Average Treatment Effect on the Treated (ATET)
ATET = E[Yi(1)− Yi(0) |Wi = 1]
Causal Lift (L) (this talk)
L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1
Conditional Average Treatment Effect: (Athey/Imbens et al)
τ(x) = E[Yi(1)− Yi(0) | Xi = x ]
Conditional Response Rate (usual Machine Learning problem)
R(x) = E[Yi(1) | Xi = x ]
Causal Effect Illustration
Causal Effect Illustration
Causal Effect Illustration
Causal Effect Illustration
Causal Effect Illustration
Causal Effect Illustration: Counterfactuals
Causal Effect Illustration: Counterfactuals
Causal Effect Illustration: Counterfactuals
Causal Effect with Counterfactuals
Counterfactuals are unobservable!
Instead of comparing:
I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,
We compare:
I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.
=⇒ using randomization
Causal Effect with Counterfactuals
Counterfactuals are unobservable!
Instead of comparing:
I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,
We compare:
I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.
=⇒ using randomization
Causal Effect with Counterfactuals
Counterfactuals are unobservable!
Instead of comparing:
I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,
We compare:
I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.
=⇒ using randomization
Causal Effect with Counterfactuals
Counterfactuals are unobservable!
Instead of comparing:
I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,
We compare:
I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.
=⇒ using randomization
Ideal Randomized Test:
Randomize after winning bid
Ideal Randomized Test
Ideal Randomized Test
Ideal Randomized Test
Ideal Randomized Test: Ad lift
Ideal Randomized Test: Ad lift
But is this practical?
Ideal Randomized Test
Ideal Randomized Test
Ideal Randomized Test
Ideal Randomized Test: Wasted spend
MediaMath’s approach:
Randomize before bidding
A Less Wasteful Randomized Test
A Less Wasteful Randomized Test
A Less Wasteful Randomized Test
Compare RC vs RT ?
Compare RC vs RT ?
Compare RC vs RT ?
Compare RC vs RTW ?
Compare RC vs RTW ? Win-bias
Ad Lift: Proper Definition
Ad Lift: Proper Definition
Ad Lift: Proper Definition
Ad Lift: Proper Definition
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Estimating the Counterfactual RCW
Ad Lift Estimation
Main steps:
I observe response rates RC ,RTW ,RTL
I observe test win-rate w
I estimate the control counterfactual winner response-rate
RCW = RC − (1− w)RTLw
I compute lift L = RTW /RCW − 1
I similar to Treatment Effect Under Non-compliance in clinicialtrials.
Ad Lift Estimation
Main steps:
I observe response rates RC ,RTW ,RTL
I observe test win-rate w
I estimate the control counterfactual winner response-rate
RCW = RC − (1− w)RTLw
I compute lift L = RTW /RCW − 1
I similar to Treatment Effect Under Non-compliance in clinicialtrials.
Ad Lift Estimation
Main steps:
I observe response rates RC ,RTW ,RTL
I observe test win-rate w
I estimate the control counterfactual winner response-rate
RCW = RC − (1− w)RTLw
I compute lift L = RTW /RCW − 1
I similar to Treatment Effect Under Non-compliance in clinicialtrials.
Ad Lift Estimation
Main steps:
I observe response rates RC ,RTW ,RTL
I observe test win-rate w
I estimate the control counterfactual winner response-rate
RCW = RC − (1− w)RTLw
I compute lift L = RTW /RCW − 1
I similar to Treatment Effect Under Non-compliance in clinicialtrials.
Ad Lift Estimation
Main steps:
I observe response rates RC ,RTW ,RTL
I observe test win-rate w
I estimate the control counterfactual winner response-rate
RCW = RC − (1− w)RTLw
I compute lift L = RTW /RCW − 1
I similar to Treatment Effect Under Non-compliance in clinicialtrials.
Ad Lift Estimation
How to compute the 90% confidence interval for L?
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:
I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:
I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Confidence Intervals with Gibbs sampler
Bayesian approach
I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)
I Set up prior distribution on θ ∼ p(θ)
I Sample M values of unknown θ from posterior: Gibbs Sampler
P(θ |Data) ∝ P(Data | θ) · p(θ)
I For each sampled θ compute lift L = RTW /RCW − 1
I Compute (0.05, 0.95) quantiles of sampled L values
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift: Gibbs Sampling
Ad Lift Gibbs Sampling: Random variables
Probabilities: w ,RTW ,RCW ,RL
Counts: CW 0,CW 1,CL0,CL1
Beta(1, 1) priors on probabilities, e.g.:
w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .
Ad Lift Gibbs Sampling: Random variables
Probabilities: w ,RTW ,RCW ,RL
Counts: CW 0,CW 1,CL0,CL1
Beta(1, 1) priors on probabilities, e.g.:
w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)
∝ RkL (1− RL)n−k · Beta(1, 1)
∝ Rk+1L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)
∝ RkL (1− RL)n−k · Beta(1, 1)
∝ Rk+1L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)
∝ RkL (1− RL)n−k · Beta(1, 1)
∝ Rk+1L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk
L (1− RL)n−k · Beta(1, 1)
∝ Rk+1L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk
L (1− RL)n−k · Beta(1, 1)∝ Rk+1
L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed
I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:
Binom(k, n;RL) ∝ RkL (1− RL)n−k ,
so posterior of RL
P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk
L (1− RL)n−k · Beta(1, 1)∝ Rk+1
L (1− RL)n−k+1
∝ Beta(k + 1, n − k + 1)
Ad Lift Gibbs Sampling: Posterior Counts
We observe C1 = CL1 + CW 1 (total control conversions).
Need to sample CL1,CW 1
CW 1 is a Binomial draw from n = C1, with probability:
P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL
CL1 = C1− CW 1
Ad Lift Gibbs Sampling: Posterior Counts
We observe C1 = CL1 + CW 1 (total control conversions).
Need to sample CL1,CW 1
CW 1 is a Binomial draw from n = C1, with probability:
P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL
CL1 = C1− CW 1
Ad Lift Gibbs Sampling: Posterior Counts
We observe C1 = CL1 + CW 1 (total control conversions).
Need to sample CL1,CW 1
CW 1 is a Binomial draw from n = C1, with probability:
P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL
CL1 = C1− CW 1
Complication 1: We only observe cookies, not users;
A user’s cookies may be in both test and control(Contamination)
Control Contamination due to Multiple Cookies
Control Contamination due to Multiple Cookies
Control Contamination due to Multiple Cookies
Control Contamination due to Multiple Cookies
Control Contamination due to Multiple Cookies
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?
I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?
I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Cookie-Contamination Questions
I How does cookie contamination affect measured lift?
I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies
I What is the influence of the control percentage?
I Simulations best way to understand this
I Monte carlo simulations using Spark
Simulations for cookie-contamination
I A scenario is a combination of parameters:I M = # trials for this scenario, usually 10K-1MI n = # users, typically 10K - 10MI p = # control percentage (usually 10-50%)I k = cookie-distribution, expressed as 1 : 100, or 1 : 70, 3 : 30I r = (un-contaminated) control user response rateI a = true lift, i.e. exposed user response rate = r ∗ (1 + a).
I A scenario file specifies a scenario in each row.I could be thousands of scenarios
Complication 2:
Long-running experiments
Long-Running Experiments
Ideal randomized test is instantaneous.
When a test is run for weeks/months,
I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.
Our approach (details omitted):
I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line
Long-Running Experiments
Ideal randomized test is instantaneous.
When a test is run for weeks/months,
I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.
Our approach (details omitted):
I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line
Long-Running Experiments
Ideal randomized test is instantaneous.
When a test is run for weeks/months,
I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.
Our approach (details omitted):
I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line
MediaMath’s Placebo App
I Currently in production for ∼ 10 advertisersI Advertisers can specify which campaigns to measureI Lift estimation, Gibbs Sampling runs on AWS using SparkI Multiple runs of Gibbs Sampler in parallel (with different priors)