Estimating Causal Effect of Ads in a Real-Time Bidding Platform

transcript

Estimating Causal Effect of Ads in aReal-Time-Bidding Platform

Prasad Chalasani (SVP Data Science, MediaMath)

Sep 24, 2016

Project PlaceboOr,

How to Measure Causal Effect of Ads in an RTB Platform

Placebo Team (alphabetical):

I Ari Buchalter (President, Technology; co-founder)I Prasad ChalasaniI Himanish KusharyI Jason LeiI Jonathan MarshallI Michael NeissI Tristan PironI Sara SkrmettiI Jawad StouliI Jaynth ThiagarajanI Ezra Winston

I listen to ~ 100 Bln ad opportunities daily

I respond with optimal bids within milliseconds

I petabytes of data (ad impressions, visits, clicks, conversions)

Key Conceptual Take-aways

I Definition of causal effect

I Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:

I Long-running experimentsI Multiple cookies per user

I Definition of causal effectI Context: relationship to Machine Learning

I Causal effect in a Real-Time Bidding Platform

I Definition of causal effectI Context: relationship to Machine LearningI Causal effect in a Real-Time Bidding Platform

I Simplest approach is wasteful

I Less wasteful approach: bias (non-compliance)I MediaMath’s solution

I Simplest approach is wastefulI Less wasteful approach: bias (non-compliance)

I MediaMath’s solutionI Bayesian Methods for Ad Lift Confidence Bounds

I Gibbs Sampling (MCMC – Markov Chain Monte Carlo)I Complications unique to our setting:

I Bayesian Methods for Ad Lift Confidence BoundsI Gibbs Sampling (MCMC – Markov Chain Monte Carlo)

I Complications unique to our setting:I Long-running experiments

I Multiple cookies per user

I Complications unique to our setting:I Long-running experimentsI Multiple cookies per user

Ad impact measurement

I Advertisers want to know the impact of showing ads to people.

Measuring Ad Impact: Two Approaches

I Observational studies:

I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:

I Randomly assign people to test (exposed), control (un-exposed)

I Observational studies:I Compare people who happen to be exposed vs not exposed

I Bias a big issue

I Randomized tests:

I Observational studies:I Compare people who happen to be exposed vs not exposedI Bias a big issue

I Randomized tests:

I Randomized tests:I Randomly assign people to test (exposed), control (un-exposed)

Causal Effect: the questions to ask

When a set of people U is exposed to ads,

I what is the avg response-rate R1 of the people in U?

I what would have been the response rate R0 of U, if theyhad not seen the ad?

I relative causal effect, or causal lift = R1/R0 − 1

I what is the avg response-rate R1 of the people in U?I what would have been the response rate R0 of U, if theyhad not seen the ad?

Causal Effect: Notation

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )

I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

I Yi(0) = response when not exposed to an adI Yi(1) = response when exposed to an ad

I Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

I Xi = k-dimensional vector of features

I e.g. (dayOfWeek, age, location, web-domain)

I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.

I Each unit i has 2 potential responses:

i = Yi(Wi)

τi = Yi(1)− Yi(0)

I “units” i = 1, 2, . . . , n (“users”, “user_context”, . . . )I Yi = 1 if unit i responds (buys, subscribes, . . . ), else 0.I Each unit i has 2 potential responses:

i = Yi(Wi)

τi = Yi(1)− Yi(0)

I Yi(0) = response when not exposed to an ad

I Yi(1) = response when exposed to an adI Wi = 1 if unit i exposed to ad, else 0.I Observed response: Y obs

i = Yi(Wi)

τi = Yi(1)− Yi(0)

i = Yi(Wi)

τi = Yi(1)− Yi(0)

I Wi = 1 if unit i exposed to ad, else 0.

I Observed response: Y obsi = Yi(Wi)

τi = Yi(1)− Yi(0)

i = Yi(Wi)

τi = Yi(1)− Yi(0)

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactual

I if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactualI Xi = k-dimensional vector of features

τi = Yi(1)− Yi(0)

i = Yi(Wi)I if Wi = 1, only Yi(1) is observed, Yi(0) is a counterfactualI if Wi = 0, only Yi(0) is observed, Yi(1) is a counterfactual

τi = Yi(1)− Yi(0)

I e.g. (dayOfWeek, age, location, web-domain)I Unit level causal effect is impossible to measure:

τi = Yi(1)− Yi(0)

I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)

τi = Yi(1)− Yi(0)

I Xi = k-dimensional vector of featuresI e.g. (dayOfWeek, age, location, web-domain)

τi = Yi(1)− Yi(0)

Average Causal/Treatment Effects

Average Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

Average Treatment Effect on the Treated (ATET)

ATET = E[Yi(1)− Yi(0) |Wi = 1]

Causal Lift (L) (this talk)

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

Conditional Average Treatment Effect: (Athey/Imbens et al)

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

Conditional Response Rate (usual Machine Learning problem)

R(x) = E[Yi(1) | Xi = x ]

Average Causal/Treatment EffectsAverage Treatment Effect (ATE)

ATE = E[Yi(1)− Yi(0)]

ATET = E[Yi(1)− Yi(0) |Wi = 1]

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

R(x) = E[Yi(1) | Xi = x ]

ATE = E[Yi(1)− Yi(0)]

ATET = E[Yi(1)− Yi(0) |Wi = 1]

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

R(x) = E[Yi(1) | Xi = x ]

ATE = E[Yi(1)− Yi(0)]

ATET = E[Yi(1)− Yi(0) |Wi = 1]

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

R(x) = E[Yi(1) | Xi = x ]

ATE = E[Yi(1)− Yi(0)]

ATET = E[Yi(1)− Yi(0) |Wi = 1]

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

R(x) = E[Yi(1) | Xi = x ]

ATE = E[Yi(1)− Yi(0)]

ATET = E[Yi(1)− Yi(0) |Wi = 1]

L = E[Yi(1) |Wi = 1]E[Yi(0) |Wi = 1] − 1

τ(x) = E[Yi(1)− Yi(0) | Xi = x ]

R(x) = E[Yi(1) | Xi = x ]

Causal Effect Illustration

Causal Effect Illustration: Counterfactuals

Causal Effect with Counterfactuals

Counterfactuals are unobservable!

Instead of comparing:

I Resp-rate of exposed users U vsI Counterfactual un-exposed response-rate of same users U,

We compare:

I Resp-rate of exposed users U vsI Resp-rate of un-exposed users statistically equivalent to U.

=⇒ using randomization

We compare:

Ideal Randomized Test:

Randomize after winning bid

Ideal Randomized Test

Ideal Randomized Test: Ad lift

But is this practical?

Ideal Randomized Test

Ideal Randomized Test: Wasted spend

MediaMath’s approach:

Randomize before bidding

A Less Wasteful Randomized Test

Compare RC vs RT ?

Compare RC vs RTW ?

Compare RC vs RTW ? Win-bias

Ad Lift: Proper Definition

Estimating the Counterfactual RCW

Ad Lift Estimation

Main steps:

I observe response rates RC ,RTW ,RTL

I observe test win-rate w

I estimate the control counterfactual winner response-rate

RCW = RC − (1− w)RTLw

I compute lift L = RTW /RCW − 1

I similar to Treatment Effect Under Non-compliance in clinicialtrials.

Ad Lift Estimation

Main steps:

Ad Lift Estimation

Main steps:

Ad Lift Estimation

Main steps:

Ad Lift Estimation

Main steps:

Ad Lift Estimation

How to compute the 90% confidence interval for L?

Ad Lift: Confidence Intervals with Gibbs sampler

Bayesian approach

I Assume a random parameter vector θ consisting of:

I (RTW ,RL,RCW ,w , ...)

I Set up prior distribution on θ ∼ p(θ)

I Sample M values of unknown θ from posterior: Gibbs Sampler

P(θ |Data) ∝ P(Data | θ) · p(θ)

I For each sampled θ compute lift L = RTW /RCW − 1

I Compute (0.05, 0.95) quantiles of sampled L values

Bayesian approach

I Assume a random parameter vector θ consisting of:

I (RTW ,RL,RCW ,w , ...)

Bayesian approach

I Assume a random parameter vector θ consisting of:I (RTW ,RL,RCW ,w , ...)

Bayesian approach

Ad Lift: Gibbs Sampling

Ad Lift Gibbs Sampling: Random variables

Probabilities: w ,RTW ,RCW ,RL

Counts: CW 0,CW 1,CL0,CL1

Beta(1, 1) priors on probabilities, e.g.:

w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .

Ad Lift Gibbs Sampling: Random variables

Probabilities: w ,RTW ,RCW ,RL

Counts: CW 0,CW 1,CL0,CL1

Beta(1, 1) priors on probabilities, e.g.:

w ∼ Beta(1, 1) ∼ Uniform(0, 1), . . .

Ad Lift Gibbs Sampling: Posterior ProbabilitiesLikelihood of observed

I k = CL1 + TL1 conversions out ofI n = CL1 + TL1 + CL0 + TL0 trials,I given loser reponse-rate RL:

Binom(k, n;RL) ∝ RkL (1− RL)n−k ,

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)

∝ RkL (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

so posterior of RL

P(RL | k, n) ∝ P(k, n | RL) · p(RL)∝ Rk

L (1− RL)n−k · Beta(1, 1)

∝ Rk+1L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

so posterior of RL

L (1− RL)n−k · Beta(1, 1)∝ Rk+1

L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

so posterior of RL

L (1− RL)n−k · Beta(1, 1)∝ Rk+1

L (1− RL)n−k+1

∝ Beta(k + 1, n − k + 1)

Ad Lift Gibbs Sampling: Posterior Counts

We observe C1 = CL1 + CW 1 (total control conversions).

Need to sample CL1,CW 1

CW 1 is a Binomial draw from n = C1, with probability:

P(ctl winner | ctl conversion) = w · RCWw · RCW + (1− w) · RL

CL1 = C1− CW 1

Complication 1: We only observe cookies, not users;

A user’s cookies may be in both test and control(Contamination)

Control Contamination due to Multiple Cookies

Cookie-Contamination Questions

I How does cookie contamination affect measured lift?

I Does the cookie-distribution matter?

I everyone has k cookies vs an average of k cookies

I What is the influence of the control percentage?

I Simulations best way to understand this

I Monte carlo simulations using Spark

I Does the cookie-distribution matter?

I everyone has k cookies vs an average of k cookies

I Does the cookie-distribution matter?I everyone has k cookies vs an average of k cookies

Simulations for cookie-contamination

I A scenario is a combination of parameters:I M = # trials for this scenario, usually 10K-1MI n = # users, typically 10K - 10MI p = # control percentage (usually 10-50%)I k = cookie-distribution, expressed as 1 : 100, or 1 : 70, 3 : 30I r = (un-contaminated) control user response rateI a = true lift, i.e. exposed user response rate = r ∗ (1 + a).

I A scenario file specifies a scenario in each row.I could be thousands of scenarios

Complication 2:

Long-running experiments

Long-Running Experiments

Ideal randomized test is instantaneous.

When a test is run for weeks/months,

I A test user may sometimes be a winner, sometimes loser.I How to define who is a “winner” and “loser”?I Crucial because lift L = RTW /RCW − 1.

Our approach (details omitted):

I Ad influence period is limitedI “refresh” a user after suitable time-period elapses.I Count “user time-spans” rather than “users”I Identify “experiments” within user’s time-line

MediaMath’s Placebo App

I Currently in production for ∼ 10 advertisersI Advertisers can specify which campaigns to measureI Lift estimation, Gibbs Sampling runs on AWS using SparkI Multiple runs of Gibbs Sampler in parallel (with different priors)

Thank you!

pchalasani@mediamath.com

Estimating Causal Effect of Ads in a Real-Time Bidding Platform

Data & Analytics