+ All Categories
Home > Documents > Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning:...

Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning:...

Date post: 22-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
46
Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics, University of Michigan Advisor: Prof. Susan A. Murphy Bell Labs Murray Hill, New Jersey January 9, 2009 1 / 46
Transcript
Page 1: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Multistage Decision Policies via Q-learning:Bias Correction and Confidence Intervals

Bibhas ChakrabortyDepartment of Statistics, University of Michigan

Advisor: Prof. Susan A. Murphy

Bell LabsMurray Hill, New Jersey

January 9, 2009

1 / 46

Page 2: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

2 / 46

Page 3: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

One-stage Decision Problem

Data on (O,A,R): observation, action, reward.

Example 1: Medical Decision Support System

O : Patient’s pre-treatment variables (e.g., age, gender, ...)

A : Treatment

R : Outcome (e.g., patient’s survival time after treatment)

Example 2: Marketing Recommender System

O : Customer behavior (e.g., past purchases, ...)

A : Product recommendation (e.g., advertisements)

R : Profit

Goal: To find (learn) a policy d : O → A to optimize E[R|O,A].3 / 46

Page 4: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Multistage Medical Decision Problem

Diseases like depression, schizophrenia, drug and alcoholdependence, HIV infection, ... are treated in multiple stages.

– At each stage, treatment is adapted to the available informationon patient characteristics and past treatments.

Information collected over multiple stages on a single patientform a longitudinal trajectory: (O1,A1,R1,O2,A2,R2, . . .).

Goal: To learn a “good” treatment policy d ≡ (d1, d2, . . .) from atraining data set of such trajectories.

4 / 46

Page 5: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Multistage Medical Decision Problem

The learned policy d can be employed to determine optimaltreatments (actions) in future.

As policies, here we will consider parametric functions only,e.g., dt(ot) = sign(ψTot) for treatments coded −1/1.

– Thus learning a policy essentially means estimating the “policyparameters” ψ.

This talk is about estimation of ψ’s, and also confidence intervalsfor them.

5 / 46

Page 6: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Literature on Learning Optimal d

At the interface of Statistics and Reinforcement Learning ...

Direct methods:– Policy Search (Kearns, Mansour, and Ng, NIPS, 1999)– Weighting (Murphy, van der Laan, and Robins, JASA, 2001)

Likelihood-based methods:– Thall et al. (2000, 2002, 2006)

Regression-based methods:– Q-learning (Watkins, 1989; Sutton and Barto, 1998) or Fitted

Q-iteration (Ernst et al., JMLR, 2005; Antos et al., NIPS, 2007)– A-learning (Murphy, JRSS-B, 2003) or Structural Nested Mean

Models (Robins, 2004)

6 / 46

Page 7: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

7 / 46

Page 8: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Our Set-up: Two Stages and Binary Actions

Single Trajectory:

O1,A1,R1︸ ︷︷ ︸t=1

,O2,A2,R2︸ ︷︷ ︸t=2

Ot : Observation (pre-treatment variables) at the t-th stage

At : Action (treatment) at the t-th stage, At ∈ {−1, 1}Ht : History at the t-th stage,H1 = O1,H2 = {O1,A1,O2}

(can’t forget the past when taking decisions)

Rt : Reward at the t-th stage

Training Data Set: {Oi1,Ai1,Ri1,Oi2,Ai2,Ri2}, i = 1, . . . , n.

8 / 46

Page 9: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Motivation for Q-learning

The intuition comes from Dynamic Programming (Bellman,1957) in case of known Q-functions: Move backward in time totake care of the “delayed effect” of actions.

Two Q-functions (“Q” stands for “quality”):

Q2(h2, a2) = E[R2

∣∣∣H2 = h2,A2 = a2

],∀a2

Q1(h1, a1) = E[R1 + max

a2Q2(H2, a2)︸ ︷︷ ︸

delayed effect

∣∣∣H1 = h1,A1 = a1

],∀a1

Optimal policy:

dt(ht) = arg maxat

Qt(ht, at), t = 1, 2.

9 / 46

Page 10: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Q-learning with Linear Regression

Model for Q-functions:Qt(Ht,At;βt, ψt) = βT

t Ht0 + (ψTt Ht1)At, t = 1, 2.

– Ht0 and Ht1 are two possibly different vector features of Ht.

Stage-2 Regression:

(β2, ψ2) = arg minβ2,ψ21n

∑ni=1

(R2i − Q2(H2i,A2i;β2, ψ2)

)2.

Stage-1 Pseudo-outcome:

Y1i ← R1i + maxa Q2(H2i, a; β2, ψ2), i = 1, . . . , n.

Stage-1 Regression:

(β1, ψ1) = arg minβ1,ψ11n

∑ni=1

(Y1i − Q1(H1i,A1i;β1, ψ1)

)2.

Estimated Optimal Policy:dt(ht) = arg maxa Qt(ht, a; βt, ψt) = sign(ψT

t ht1),∀t.10 / 46

Page 11: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

11 / 46

Page 12: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Bias in ψ1

Y1i = R1i + maxa Q2(H2i, a; β2, ψ2) = R1i + βT2 H20,i + |ψT

2 H21,i|

– Maximization is a non-smooth, non-linear operation. In general,

E[maxa

Q2(H2i, a; β2, ψ2)] 6= maxa

Q2(H2i, a;β2, ψ2),

even if Q2(H2i, a; β2, ψ2) is unbiased for Q2(H2i, a;β2, ψ2).

– Thus Y1i is a biased “estimate” of R1i + maxa Q2(H2i, a;β2, ψ2).

– Bias in Y1i, i = 1, . . . , n, can induce bias in ψ1 (estimate of thestage-1 policy parameters). Possibly sub-optimal policy!

12 / 46

Page 13: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Toy Example to Illustrate Bias

To estimate |µ| based on X1, . . . ,Xni.i.d.∼ N(µ, 1).

The maximum likelihood estimator of |µ| is |Xn|, where Xn is thesample mean.

At µ = 0, the point of non-differentiability of |µ|, |Xn| is biased.

limn→∞

E[√

n(|Xn| − |µ|)] =

{ √2π if µ = 00 if µ 6= 0

Detailed discussion on bias in policy estimation:Robins, 2004; Moodie and Richardson, 2008.

13 / 46

Page 14: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

14 / 46

Page 15: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Why Confidence Intervals?

We want to construct confidence intervals (CIs) for the policyparameters ψ1, ψ2 in order to:

– reduce the number of variables to be collected for futureimplementations of d (i.e., variable selection)

– know when there is insufficient evidence in the training data torecommend one treatment over another – in such cases, treatmentcan be chosen by other considerations like cost, familiarity,burden, preference etc.

CIs for ψ2 are fine, but not for ψ1. Why?

15 / 46

Page 16: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Confidence Intervals and Non-regularity

Y1i = R1i + maxa Q2(H2i, a; β2, ψ2) = R1i + βT2 H20,i + |ψT

2 H21,i|

CIs can be wrongly centered due to the bias in ψ1.

Because of the lack of smoothness (non-differentiability) of Y1i:

– The approximate large-sample distribution of ψ1 is normal ifP[ψT

2 H21 = 0] = 0 and non-normal if P[ψT2 H21 = 0] > 0.

– P[ψT2 H21 = 0] denotes the population proportion of zero stage-2

effects.

– The large-sample distribution does not converge uniformly overthe parameter space, and the change between the twodistributions is abrupt: non-regularity (Robins, 2004).

16 / 46

Page 17: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Confidence Intervals and Non-regularity

Whenever true ψT2 H21 ≈ 0, CIs based on Taylor series arguments

(Wald type CIs) show poor frequentist properties (e.g., Robins,2004; Moodie and Richardson, 2008).

Usual bootstrap CIs also show poor frequentist properties– Bootstrap is inconsistent due to non-differentiability (e.g., Shao,

1994).– We will illustrate this in simulations to follow.

So what could be a remedy?

17 / 46

Page 18: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

18 / 46

Page 19: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Soft-threshold Estimator

YST1i = R1i + βT

2 H20,i + |ψT2 H21,i| ·

(1− λi

|ψT2 H21,i|2

)+, λi > 0

The corresponding estimator ψST1 is the soft-threshold estimator.

19 / 46

Page 20: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Soft-threshold Estimator

The soft-threshold estimator is akin to the shrinkage estimatorsin supervised learning:

– non-negative garrote (Breiman, Technometrics, 1995)– adaptive lasso (Zou, JASA, 2006)

It also resembles the more classical positive-part James-Steinestimator (Baranchik, 1970; Efron and Moris, 1973).

YST1i is still a non-smooth function, but the problematic term|ψT

2 H21,i| is shrunk (thresholded) towards zero.

One “good” choice of the tuning parameters λi, i = 1, . . . , n,can be derived from a Bayesian formulation of the problem.

– We use a Bayesian model originally used for wavelet-basedimage estimation by Figueiredo and Nowak (IEEE Transactionson Image Processing, 2001).

Page 21: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Choice of Tuning Parameter

Inspired by the work of Figueiredo and Nowak (2001) ...

Lemma

Consider a hierarchical Bayesian model where X|µ ∼ N(µ, σ2) withknown σ2, the prior on µ is given by µ|φ2 ∼ N(0, φ2), along withJeffrey’s noninformative hyper-prior on φ2, e.g., p(φ2) ∝ 1/φ2. Thenan empirical Bayes estimator of |µ| is given by

|µ|EB

= X(

1− 3σ2

X2

)+(

2Φ(Xσ

√(1− 3σ2

X2

)+)− 1

)

+

√2πσ

√(1− 3σ2

X2

)+exp

{− X2

2σ2

(1− 3σ2

X2

)+}.

Furthermore, for |Xσ | large, |µ|EB≈ |X|

(1− 3σ2

X2

)+.

Page 22: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Choice of Tuning Parameter

For fixed H21,i, i = 1, . . . , n separately:

– Put X = ψT2 H21,i, and µ = ψT

2 H21,i.

– Plug in σ2 = HT21,iΣ2H21,i/n for σ2.

– Apply the previous lemma.

This leads to the choice λi = 3HT21,iΣ2H21,i/n for the

soft-threshold pseudo-outcome YST1i :

YST1i = R1i + βT

2 H20,i + |ψT2 H21,i| ·

(1− 3HT

21,iΣ2H21,i

n|ψT2 H21,i|2

)+.

22 / 46

Page 23: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

23 / 46

Page 24: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Simulation Design

Results from 1000 Monte Carlo simulations, each simulated data set isof size n = 300.

Generative Model:

O1,A1,A2 ∈ {−1, 1} with probability 0.5.O2 ∈ {−1, 1} with P[O2 = 1|O1,A1] varied in the examples.R1 = 0.

Generation of R2 varies in the examples.

Analysis Model:

Q2 = β20 + β21O1 + β22A1 + β23O1A1 + (ψ20 + ψ21O2 + ψ22A1)︸ ︷︷ ︸ψT

2 H21

A2

Q1 = β10 + β11O1 + (ψ10 + ψ11O1)A1

Two types of bootstrap CIs, e.g., percentile and hybrid (Efron andTibshirani, 1993), with 1000 bootstrap iterations.

Page 25: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Example 1 (Non-regular)

P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 1

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max 0.0003 0.0045 96.8* 93.5*soft-threshold 0.0009 0.0036 95.3 96.1

Intended coverage rate of CIs = 95%.25 / 46

Page 26: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Example 2 (Nearly Non-regular)

P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = 0.01A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0 but ψT

2 H21 ≈ 0

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max 0.0003 0.0045 96.7* 93.4*soft-threshold 0.0008 0.0036 95.4 95.9

Intended coverage rate of CIs = 95%.26 / 46

Page 27: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Example 3 (Non-regular)

P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = −0.5A1 + 0.5A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0.5

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max -0.0401 0.0075 88.4* 92.7*soft-threshold -0.0185 0.0058 93.4* 94.9

Intended coverage rate of CIs = 95%.27 / 46

Page 28: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Example 4 (Non-regular)

P[O2 = 1|O1,A1] =exp(O1)

1 + exp(O1)R2 = −0.5A1 + A2 + 0.5O2A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0.25

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max -0.0209 0.0074 92.7* 93.1*soft-threshold -0.0065 0.0069 93.8 94.6

Intended coverage rate of CIs = 95%.28 / 46

Page 29: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Example 5 (Regular)

P[O2 = 1|O1,A1] =exp(0.1(O1 + A1))

1 + exp(0.1(O1 + A1))R2 = −0.5A1 + A2 + 0.5O2A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0 and ψT

2 H21 is not very close to 0

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max 0.0009 0.0067 95.0 93.8soft-threshold 0.0052 0.0074 94.8 91.7*

Intended coverage rate of CIs = 95%.29 / 46

Page 30: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

30 / 46

Page 31: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Smoking Cessation Study (Simplified)

Randomized Study at U-Michigan (Strecher et al., 2008)

Stage-1:

O1 : motivation (1-10), self-efficacy (1-10),

education (≤ high school vs. > high school)

A1 = (A11,A12), A1j ∈ {−1, 1}, low vs. high

: A11 = message source, A12 = tailoring of story

R1 : quit status at 6 months (1=quit, 0=not quit)

Stage-2:

O2 : quit status at 6 months (1=quit, 0=not quit)

A2 : treatment vs. control {−1, 1}R2 : quit status at 12 months (1=quit, 0=not quit)

31 / 46

Page 32: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Data Analysis Results

No significant stage 2 treatment effect: Non-regularity!

Stage 1 Analysis SummaryVariable Method Coefficient 95% Bootstrap CImotivation hard-max 0.04 (-0.00, 0.08)

soft-threshold 0.04 (0.00, 0.08)*self-efficacy hard-max 0.03 (0.00, 0.06)*

soft-threshold 0.03 (0.00, 0.06)*education hard-max -0.01 (-0.07, 0.06)

soft-threshold -0.01 (-0.07, 0.06)source hard-max -0.15 (-0.35, 0.06)

soft-threshold -0.15 (-0.35, 0.06)source:self-efficacy hard-max 0.03 (0.00, 0.06)*

soft-threshold 0.03 (0.00, 0.06)*story hard-max 0.05 (-0.01, 0.11)

soft-threshold 0.05 (-0.01, 0.11)story:education hard-max -0.07 (-0.13, -0.01)*

soft-threshold -0.07 (-0.13, -0.01)*

Page 33: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

What have we learned as a policy-maker?

Source by Self-efficacy Interaction

The “highly personalized” level of source is more effective forsubjects with a higher self-efficacy (7 or above on a 1-10 scale).

Page 34: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

What have we learned as a policy-maker?

Story by Education Interaction

The “deeply tailored” level of story is more effective for subjects withless education (≤ high school).

Page 35: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

35 / 46

Page 36: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Discussion

Soft-threshold estimator reduces bias and improves coverage ofCIs in estimation of optimal policies via Q-learning undernon-regular settings.

It can be used for both randomized and observational data.

The original hard-max estimator should be preferred over thesoft-threshold estimator when the stage 2 treatments are “toodifferent” (so-called regular settings).

– However regular settings are unlikely to occur in clinical trialdata due to ethical considerations (Freedman, NEJM, 1987).

36 / 46

Page 37: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

More Sophisticated Bootstrap CIs

We tried double bootstrap (Davison and Hinkley, 1997;Nankervis, 2005) with the hard-max estimator.

– It gave valid coverage rates of CIs.

– It is computationally much more expensive.

We did not try m-out-of-n bootstrap (Shao, 1994).

– Choice of m in the present context is not obvious.

– It gives slower rate of convergence than√

n even in regularsettings.

37 / 46

Page 38: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Ongoing and Future Work

Does the soft-threshold estimator have lower MSE (risk) than thehard-max estimator?

– Need to derive some theoretical optimality result, similar to theresults in the area of wavelet shrinkage (e.g., Donoho andJohnstone, Biometrika, 1994; Johnstone and Silverman, Annals ofStatistics, 2004).

How does the method generalize to more than two treatments(actions) and more than two stages?

To what extent does the problem of non-regularity arise in caseof more sophisticated regression models (e.g., regression trees,splines, neural networks, ...)?

38 / 46

Page 39: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

Contribution

“Evidence-based” treatment policies are of great interest inmedicine – we tried to address some need of this community.

We developed a shrinkage and thresholding scheme to reducebias and to construct valid confidence intervals for somenon-regular parameters of practical importance.

Our work offers a way of improving Q-learning (FittedQ-iteration) that can benefit the Reinforcement learningcommunity.

Our method can be potentially useful in other multistagedecision problems, e.g., market intelligence.

39 / 46

Page 40: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Main References

L. Breiman (1995). Technometrics, 37(4): 373-384.

B. Chakraborty, V. Strecher, and S. Murphy (2008). Tentativelyaccepted by Statistical Methods in Medical Research.

B. Chakraborty, V. Strecher, and S. Murphy (2008). NIPS Workshop onModel Uncertainty and Risk in Reinforcement Learning.

M. Figueiredo and R. Nowak (2001). IEEE Transactions on ImageProcessing, 10(9): 1322-1331.

E. Moodie and T. Richardson (2008). To appear in ScandinavianJournal of Statistics.

J. Robins (2004). Proceedings of the Second Seattle Symposium onBiostatistics, New York, Springer.

V. Strecher, J. McClure, G. Alexander, B. Chakraborty, V. Nair, et al.(2008). American Journal of Preventive Medicine, 34(5): 373-381.

C. Watkins (1989). PhD Thesis, Cambridge University.

Page 41: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Questions? Comments?

Thank you!

Page 42: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Q-learning and Structural Nested Mean Models

A bridge between Machine Learning and Biostatistics Literature...

LemmaConsider linear models for the Q-functions, and assume that:

(i) the parameters in Q1 and Q2 are distinct;

(ii) At has zero conditional mean given Ht, t = 1, 2; and

(iii) the covariates used in the model for Q1 are nested within thecovariates used in the model for Q2, i.e., (HT

10,HT11A1) ⊂ HT

20.

Then Q-learning is algebraically equivalent to an inefficient versionof Robins’ method of estimation in Structural Nested Mean Models.

42 / 46

Page 43: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Non-regularity in Other Areas

Inference based on shrinkage estimators:Sen and Saleh (1987), Leeb and Potscher (2006)

Post-model-selection inference:Kabaila (1995), Leeb and Potscher (2005)

Inference in autoregressive models in time series:Stock(1991), Andrews (1993)

Inference on eigenvalues of a covariance matrix:Beran and Srivastava (1985, 1987)

And many more ...

43 / 46

Page 44: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Hard-threshold Estimator

YHT1i = R1i + βT

2 H20,i + |ψT2 H21,i| · 1

{|ψT

2 H21,i| > λi

}, λi > 0

The corresponding estimator ψHT1 is the hard-threshold estimator.

44 / 46

Page 45: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Hard-threshold Estimator

The hard-threshold estimator reduces bias of the original“hard-max” estimator (Moodie and Richardson, 2008).

The hard-threshold pseudo-outcome has two points ofdiscontinuity!

No data-driven choice of λi is currently available.

45 / 46

Page 46: Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning: Bias Correction and Confidence Intervals Bibhas Chakraborty Department of Statistics,

Hard-threshold Estimator

Usually λi is set to be equal to zα/2

√HT

21,iΣ2H21,i/n, where

Σ2/n is the estimated covariance matrix of ψ2, and α is anunknown tuning parameter to be specified by the user.

YHT1i = R1i + βT

2 H20,i + |ψT2 H21,i| · 1

{ √n|ψT

2 H21,i|√HT

21,iΣ2H21,i

> zα/2

}

So the tuning parameter is α.

In simulations, we use α = 0.08, which we empirically found towork well.

46 / 46


Recommended