Estimating Causal Effect of Ads in a Real-Time Bidding Platform

Post on 16-Mar-2018

426 views 0 download

transcript

Estimating Causal Effect of Ads in aReal-Time-Bidding Platform

Prasad Chalasani (SVP Data Science, MediaMath)

Sep 24, 2016

Project PlaceboOr,

How to Measure Causal Effect of Ads in an RTB Platform

Placebo Team (alphabetical):

I Ari Buchalter (President, Technology; co-founder)I Prasad ChalasaniI Himanish KusharyI Jason LeiI Jonathan MarshallI Michael NeissI Tristan PironI Sara SkrmettiI Jawad StouliI Jaynth ThiagarajanI Ezra Winston

I listen to ~ 100 Bln ad opportunities daily

I respond with optimal bids within milliseconds

I petabytes of data (ad impressions, visits, clicks, conversions)

I listen to ~ 100 Bln ad opportunities daily

I respond with optimal bids within milliseconds

I petabytes of data (ad impressions, visits, clicks, conversions)

I listen to ~ 100 Bln ad opportunities daily

I respond with optimal bids within milliseconds

I petabytes of data (ad impressions, visits, clicks, conversions)

I listen to ~ 100 Bln ad opportunities daily

I respond with optimal bids within milliseconds

I petabytes of data (ad impressions, visits, clicks, conversions)

Key Conceptual Take-aways

I Definition of causal effect

I Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine Learning

I Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wasteful

I Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)

I MediaMath’s solutionI Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:I Long-running experiments

I Multiple cookies per user

Key Conceptual Take-aways

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:I Long-running experimentsI Multiple cookies per user

Ad impact measurement

I Advertisers want to know the impact of showing ads to people.

Measuring Ad Impact: Two Approaches

I Observational studies:

I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:

I Randomly assign people to test (exposed), control (un-exposed)

Measuring Ad Impact: Two Approaches

I Observational studies:I Compare people who happen to be exposed vs not exposed

I Bias a big issue

I Randomized tests:

I Randomly assign people to test (exposed), control (un-exposed)

Measuring Ad Impact: Two Approaches

I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:

I Randomly assign people to test (exposed), control (un-exposed)

Measuring Ad Impact: Two Approaches

I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:

I Randomly assign people to test (exposed), control (un-exposed)

Measuring Ad Impact: Two Approaches

I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:I Randomly assign people to test (exposed), control (un-exposed)

Causal Effect: the questions to ask

When a set of people U is exposed to ads,

I what is the avg response-rate R1 of the people in U?

I what would have been the response rate R0 of U, if theyhad not seen the ad?

I relative causal effect, or causal lift = R1/R0 − 1

Causal Effect: the questions to ask

When a set of people U is exposed to ads,

I what is the avg response-rate R1 of the people in U?I what would have been the response rate R0 of U, if theyhad not seen the ad?

I relative causal effect, or causal lift = R1/R0 − 1

Causal Effect: the questions to ask

When a set of people U is exposed to ads,

I what is the avg response-rate R1 of the people in U?I what would have been the response rate R0 of U, if theyhad not seen the ad?

I relative causal effect, or causal lift = R1/R0 − 1

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )

I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.

I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an ad

I Yi(1) = response when exposed to an adI Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.

I Observed response: Y obsi = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactual

I if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactualI Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

Average Causal/Treatment Effects

Average Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Causal Effect Illustration

Causal Effect Illustration

Causal Effect Illustration

Causal Effect Illustration

Causal Effect Illustration

Causal Effect Illustration: Counterfactuals

Causal Effect Illustration: Counterfactuals

Causal Effect Illustration: Counterfactuals

Causal Effect with Counterfactuals

Counterfactuals are unobservable!

Instead of comparing:

I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,

We compare:

I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.

=⇒ using randomization

Causal Effect with Counterfactuals

Counterfactuals are unobservable!

Instead of comparing:

I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,

We compare:

I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.

=⇒ using randomization

Causal Effect with Counterfactuals

Counterfactuals are unobservable!

Instead of comparing:

I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,

We compare:

I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.

=⇒ using randomization

Causal Effect with Counterfactuals

Counterfactuals are unobservable!

Instead of comparing:

I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,

We compare:

I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.

=⇒ using randomization

Ideal Randomized Test:

Randomize after winning bid

Ideal Randomized Test

Ideal Randomized Test

Ideal Randomized Test

Ideal Randomized Test: Ad lift

Ideal Randomized Test: Ad lift

But is this practical?

Ideal Randomized Test

Ideal Randomized Test

Ideal Randomized Test

Ideal Randomized Test: Wasted spend

MediaMath’s approach:

Randomize before bidding

A Less Wasteful Randomized Test

A Less Wasteful Randomized Test

A Less Wasteful Randomized Test

Compare RC vs RT ?

Compare RC vs RT ?

Compare RC vs RT ?

Compare RC vs RTW ?

Compare RC vs RTW ? Win-bias

Ad Lift: Proper Definition

Ad Lift: Proper Definition

Ad Lift: Proper Definition

Ad Lift: Proper Definition

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Estimating the Counterfactual RCW

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

How to compute the 90% confidence interval for L?

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:

I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:

I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift: Gibbs Sampling

Ad Lift Gibbs Sampling: Random variables

Probabilities: w ,RTW ,RCW ,RL

Counts: CW 0,CW 1,CL0,CL1

Beta(1, 1) priors on probabilities, e.g.:

w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .

Ad Lift Gibbs Sampling: Random variables

Probabilities: w ,RTW ,RCW ,RL

Counts: CW 0,CW 1,CL0,CL1

Beta(1, 1) priors on probabilities, e.g.:

w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk

L (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk

L (1− RL)n−k · Beta(1, 1)∝ Rk+1

L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk

L (1− RL)n−k · Beta(1, 1)∝ Rk+1

L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior Counts

We observe C1 = CL1 + CW 1 (total control conversions).

Need to sample CL1,CW 1

CW 1 is a Binomial draw from n = C1, with probability:

P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL

CL1 = C1− CW 1

Ad Lift Gibbs Sampling: Posterior Counts

We observe C1 = CL1 + CW 1 (total control conversions).

Need to sample CL1,CW 1

CW 1 is a Binomial draw from n = C1, with probability:

P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL

CL1 = C1− CW 1

Ad Lift Gibbs Sampling: Posterior Counts

We observe C1 = CL1 + CW 1 (total control conversions).

Need to sample CL1,CW 1

CW 1 is a Binomial draw from n = C1, with probability:

P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL

CL1 = C1− CW 1

Complication 1: We only observe cookies, not users;

A user’s cookies may be in both test and control(Contamination)

Control Contamination due to Multiple Cookies

Control Contamination due to Multiple Cookies

Control Contamination due to Multiple Cookies

Control Contamination due to Multiple Cookies

Control Contamination due to Multiple Cookies

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?

I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?

I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

Simulations for cookie-contamination

I A scenario is a combination of parameters:I M = # trials for this scenario, usually 10K-1MI n = # users, typically 10K - 10MI p = # control percentage (usually 10-50%)I k = cookie-distribution, expressed as 1 : 100, or 1 : 70, 3 : 30I r = (un-contaminated) control user response rateI a = true lift, i.e. exposed user response rate = r ∗ (1 + a).

I A scenario file specifies a scenario in each row.I could be thousands of scenarios

Complication 2:

Long-running experiments

Long-Running Experiments

Ideal randomized test is instantaneous.

When a test is run for weeks/months,

I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.

Our approach (details omitted):

I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line

Long-Running Experiments

Ideal randomized test is instantaneous.

When a test is run for weeks/months,

I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.

Our approach (details omitted):

I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line

Long-Running Experiments

Ideal randomized test is instantaneous.

When a test is run for weeks/months,

I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.

Our approach (details omitted):

I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line

MediaMath’s Placebo App

I Currently in production for ∼ 10 advertisersI Advertisers can specify which campaigns to measureI Lift estimation, Gibbs Sampling runs on AWS using SparkI Multiple runs of Gibbs Sampler in parallel (with different priors)

Thank you!

pchalasani@mediamath.com