Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning:...

Multistage Decision Policies via Q-learning:Bias Correction and Confidence Intervals

Bibhas ChakrabortyDepartment of Statistics, University of Michigan

Advisor: Prof. Susan A. Murphy

Bell LabsMurray Hill, New Jersey

January 9, 2009

1 / 46

Outline

1 Introduction

2 Q-learningThe AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

3 Simulation Experiments

4 Analysis of Smoking Cessation Data

5 Discussion

2 / 46

IntroductionQ-learning

Simulation ExperimentsAnalysis of Smoking Cessation Data

Discussion

One-stage Decision Problem

Data on (O,A,R): observation, action, reward.

Example 1: Medical Decision Support System

O : Patient’s pre-treatment variables (e.g., age, gender, ...)

A : Treatment

R : Outcome (e.g., patient’s survival time after treatment)

Example 2: Marketing Recommender System

O : Customer behavior (e.g., past purchases, ...)

A : Product recommendation (e.g., advertisements)

R : Profit

Goal: To find (learn) a policy d : O → A to optimize E[R|O,A].3 / 46



Discussion

Multistage Medical Decision Problem

Diseases like depression, schizophrenia, drug and alcoholdependence, HIV infection, ... are treated in multiple stages.

– At each stage, treatment is adapted to the available informationon patient characteristics and past treatments.

Information collected over multiple stages on a single patientform a longitudinal trajectory: (O1,A1,R1,O2,A2,R2, . . .).

Goal: To learn a “good” treatment policy d ≡ (d1, d2, . . .) from atraining data set of such trajectories.

4 / 46



Discussion

Multistage Medical Decision Problem

The learned policy d can be employed to determine optimaltreatments (actions) in future.

As policies, here we will consider parametric functions only,e.g., dt(ot) = sign(ψTot) for treatments coded −1/1.

– Thus learning a policy essentially means estimating the “policyparameters” ψ.

This talk is about estimation of ψ’s, and also confidence intervalsfor them.

5 / 46



Discussion

Literature on Learning Optimal d

At the interface of Statistics and Reinforcement Learning ...

Direct methods:– Policy Search (Kearns, Mansour, and Ng, NIPS, 1999)– Weighting (Murphy, van der Laan, and Robins, JASA, 2001)

Likelihood-based methods:– Thall et al. (2000, 2002, 2006)

Regression-based methods:– Q-learning (Watkins, 1989; Sutton and Barto, 1998) or Fitted

Q-iteration (Ernst et al., JMLR, 2005; Antos et al., NIPS, 2007)– A-learning (Murphy, JRSS-B, 2003) or Structural Nested Mean

Models (Robins, 2004)

6 / 46



Discussion

The AlgorithmBiasConfidence IntervalsSoft-threshold Estimator

Outline

1 Introduction




5 Discussion

7 / 46



Discussion


Our Set-up: Two Stages and Binary Actions

Single Trajectory:

O1,A1,R1︸︷︷︸t=1

,O2,A2,R2︸︷︷︸t=2

Ot : Observation (pre-treatment variables) at the t-th stage

At : Action (treatment) at the t-th stage, At ∈ {−1, 1}Ht : History at the t-th stage,H1 = O1,H2 = {O1,A1,O2}

(can’t forget the past when taking decisions)

Rt : Reward at the t-th stage

Training Data Set: {Oi1,Ai1,Ri1,Oi2,Ai2,Ri2}, i = 1, . . . , n.

8 / 46



Discussion


Motivation for Q-learning

The intuition comes from Dynamic Programming (Bellman,1957) in case of known Q-functions: Move backward in time totake care of the “delayed effect” of actions.

Two Q-functions (“Q” stands for “quality”):

Q2(h2, a2) = E[R2

∣∣∣H2 = h2,A2 = a2

],∀a2

Q1(h1, a1) = E[R1 + max

a2Q2(H2, a2)︸︷︷︸

delayed effect

∣∣∣H1 = h1,A1 = a1

],∀a1

Optimal policy:

dt(ht) = arg maxat

Qt(ht, at), t = 1, 2.

9 / 46



Discussion


Q-learning with Linear Regression

Model for Q-functions:Qt(Ht,At;βt, ψt) = βT

t Ht0 + (ψTt Ht1)At, t = 1, 2.

– Ht0 and Ht1 are two possibly different vector features of Ht.

Stage-2 Regression:

(β2, ψ2) = arg minβ2,ψ21n

∑ni=1

(R2i − Q2(H2i,A2i;β2, ψ2)

)2.

Stage-1 Pseudo-outcome:

Y1i ← R1i + maxa Q2(H2i, a; β2, ψ2), i = 1, . . . , n.

Stage-1 Regression:

(β1, ψ1) = arg minβ1,ψ11n

∑ni=1

(Y1i − Q1(H1i,A1i;β1, ψ1)

)2.

Estimated Optimal Policy:dt(ht) = arg maxa Qt(ht, a; βt, ψt) = sign(ψT

t ht1),∀t.10 / 46



Discussion


Outline

1 Introduction




5 Discussion

11 / 46



Discussion


Bias in ψ1

Y1i = R1i + maxa Q2(H2i, a; β2, ψ2) = R1i + βT2 H20,i + |ψT

2 H21,i|

– Maximization is a non-smooth, non-linear operation. In general,

E[maxa

Q2(H2i, a; β2, ψ2)] 6= maxa

Q2(H2i, a;β2, ψ2),

even if Q2(H2i, a; β2, ψ2) is unbiased for Q2(H2i, a;β2, ψ2).

– Thus Y1i is a biased “estimate” of R1i + maxa Q2(H2i, a;β2, ψ2).

– Bias in Y1i, i = 1, . . . , n, can induce bias in ψ1 (estimate of thestage-1 policy parameters). Possibly sub-optimal policy!

12 / 46



Discussion


Toy Example to Illustrate Bias

To estimate |µ| based on X1, . . . ,Xni.i.d.∼ N(µ, 1).

The maximum likelihood estimator of |µ| is |Xn|, where Xn is thesample mean.

At µ = 0, the point of non-differentiability of |µ|, |Xn| is biased.

limn→∞

E[√

n(|Xn| − |µ|)] =

{ √2π if µ = 00 if µ 6= 0

Detailed discussion on bias in policy estimation:Robins, 2004; Moodie and Richardson, 2008.

13 / 46



Discussion


Outline

1 Introduction




5 Discussion

14 / 46



Discussion


Why Confidence Intervals?

We want to construct confidence intervals (CIs) for the policyparameters ψ1, ψ2 in order to:

– reduce the number of variables to be collected for futureimplementations of d (i.e., variable selection)

– know when there is insufficient evidence in the training data torecommend one treatment over another – in such cases, treatmentcan be chosen by other considerations like cost, familiarity,burden, preference etc.

CIs for ψ2 are fine, but not for ψ1. Why?

15 / 46



Discussion


Confidence Intervals and Non-regularity

Y1i = R1i + maxa Q2(H2i, a; β2, ψ2) = R1i + βT2 H20,i + |ψT

2 H21,i|

CIs can be wrongly centered due to the bias in ψ1.

Because of the lack of smoothness (non-differentiability) of Y1i:

– The approximate large-sample distribution of ψ1 is normal ifP[ψT

2 H21 = 0] = 0 and non-normal if P[ψT2 H21 = 0] > 0.

– P[ψT2 H21 = 0] denotes the population proportion of zero stage-2

effects.

– The large-sample distribution does not converge uniformly overthe parameter space, and the change between the twodistributions is abrupt: non-regularity (Robins, 2004).

16 / 46



Discussion


Confidence Intervals and Non-regularity

Whenever true ψT2 H21 ≈ 0, CIs based on Taylor series arguments

(Wald type CIs) show poor frequentist properties (e.g., Robins,2004; Moodie and Richardson, 2008).

Usual bootstrap CIs also show poor frequentist properties– Bootstrap is inconsistent due to non-differentiability (e.g., Shao,

1994).– We will illustrate this in simulations to follow.

So what could be a remedy?

17 / 46



Discussion


Outline

1 Introduction




5 Discussion

18 / 46



Discussion


Soft-threshold Estimator

YST1i = R1i + βT

2 H20,i + |ψT2 H21,i| ·

(1− λi

|ψT2 H21,i|2

)+, λi > 0

The corresponding estimator ψST1 is the soft-threshold estimator.

19 / 46

Soft-threshold Estimator

The soft-threshold estimator is akin to the shrinkage estimatorsin supervised learning:

– non-negative garrote (Breiman, Technometrics, 1995)– adaptive lasso (Zou, JASA, 2006)

It also resembles the more classical positive-part James-Steinestimator (Baranchik, 1970; Efron and Moris, 1973).

YST1i is still a non-smooth function, but the problematic term|ψT

2 H21,i| is shrunk (thresholded) towards zero.

One “good” choice of the tuning parameters λi, i = 1, . . . , n,can be derived from a Bayesian formulation of the problem.

– We use a Bayesian model originally used for wavelet-basedimage estimation by Figueiredo and Nowak (IEEE Transactionson Image Processing, 2001).

Choice of Tuning Parameter

Inspired by the work of Figueiredo and Nowak (2001) ...

Lemma

Consider a hierarchical Bayesian model where X|µ ∼ N(µ, σ2) withknown σ2, the prior on µ is given by µ|φ2 ∼ N(0, φ2), along withJeffrey’s noninformative hyper-prior on φ2, e.g., p(φ2) ∝ 1/φ2. Thenan empirical Bayes estimator of |µ| is given by

|µ|EB

= X(

1− 3σ2

X2

)+(

2Φ(Xσ

√(1− 3σ2

X2

)+)− 1

)

+

√2πσ

√(1− 3σ2

X2

)+exp

{− X2

2σ2

(1− 3σ2

X2

)+}.

Furthermore, for |Xσ | large, |µ|EB≈ |X|

(1− 3σ2

X2

)+.



Discussion


Choice of Tuning Parameter

For fixed H21,i, i = 1, . . . , n separately:

– Put X = ψT2 H21,i, and µ = ψT

2 H21,i.

– Plug in σ2 = HT21,iΣ2H21,i/n for σ2.

– Apply the previous lemma.

This leads to the choice λi = 3HT21,iΣ2H21,i/n for the

soft-threshold pseudo-outcome YST1i :

YST1i = R1i + βT

2 H20,i + |ψT2 H21,i| ·

(1− 3HT

21,iΣ2H21,i

n|ψT2 H21,i|2

)+.

22 / 46



Discussion

Outline

1 Introduction




5 Discussion

23 / 46

Simulation Design

Results from 1000 Monte Carlo simulations, each simulated data set isof size n = 300.

Generative Model:

O1,A1,A2 ∈ {−1, 1} with probability 0.5.O2 ∈ {−1, 1} with P[O2 = 1|O1,A1] varied in the examples.R1 = 0.

Generation of R2 varies in the examples.

Analysis Model:

Q2 = β20 + β21O1 + β22A1 + β23O1A1 + (ψ20 + ψ21O2 + ψ22A1)︸︷︷︸ψT

2 H21

A2

Q1 = β10 + β11O1 + (ψ10 + ψ11O1)A1

Two types of bootstrap CIs, e.g., percentile and hybrid (Efron andTibshirani, 1993), with 1000 bootstrap iterations.



Discussion

Example 1 (Non-regular)

P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 1

Estimation of ψ10Estimator Bias MSE Coverage Coverage

of Percentile CI of Hybrid CIhard-max 0.0003 0.0045 96.8* 93.5*soft-threshold 0.0009 0.0036 95.3 96.1

Intended coverage rate of CIs = 95%.25 / 46



Discussion

Example 2 (Nearly Non-regular)

P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = 0.01A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0 but ψT

2 H21 ≈ 0


of Percentile CI of Hybrid CIhard-max 0.0003 0.0045 96.7* 93.4*soft-threshold 0.0008 0.0036 95.4 95.9




Discussion


P[O2 = 1|O1,A1] =exp(0.5(O1 + A1))

1 + exp(0.5(O1 + A1))R2 = −0.5A1 + 0.5A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0.5


of Percentile CI of Hybrid CIhard-max -0.0401 0.0075 88.4* 92.7*soft-threshold -0.0185 0.0058 93.4* 94.9




Discussion


P[O2 = 1|O1,A1] =exp(O1)

1 + exp(O1)R2 = −0.5A1 + A2 + 0.5O2A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0.25


of Percentile CI of Hybrid CIhard-max -0.0209 0.0074 92.7* 93.1*soft-threshold -0.0065 0.0069 93.8 94.6




Discussion

Example 5 (Regular)

P[O2 = 1|O1,A1] =exp(0.1(O1 + A1))

1 + exp(0.1(O1 + A1))R2 = −0.5A1 + A2 + 0.5O2A2 + 0.5A1A2 + ε, ε ∼ N(0, 1)

P[ψT2 H21 = 0] = 0 and ψT

2 H21 is not very close to 0


of Percentile CI of Hybrid CIhard-max 0.0009 0.0067 95.0 93.8soft-threshold 0.0052 0.0074 94.8 91.7*




Discussion

Outline

1 Introduction




5 Discussion

30 / 46



Discussion

Smoking Cessation Study (Simplified)

Randomized Study at U-Michigan (Strecher et al., 2008)

Stage-1:

O1 : motivation (1-10), self-efficacy (1-10),

education (≤ high school vs. > high school)

A1 = (A11,A12), A1j ∈ {−1, 1}, low vs. high

: A11 = message source, A12 = tailoring of story

R1 : quit status at 6 months (1=quit, 0=not quit)

Stage-2:

O2 : quit status at 6 months (1=quit, 0=not quit)

A2 : treatment vs. control {−1, 1}R2 : quit status at 12 months (1=quit, 0=not quit)

31 / 46

Data Analysis Results

No significant stage 2 treatment effect: Non-regularity!

Stage 1 Analysis SummaryVariable Method Coefficient 95% Bootstrap CImotivation hard-max 0.04 (-0.00, 0.08)

soft-threshold 0.04 (0.00, 0.08)*self-efficacy hard-max 0.03 (0.00, 0.06)*

soft-threshold 0.03 (0.00, 0.06)*education hard-max -0.01 (-0.07, 0.06)

soft-threshold -0.01 (-0.07, 0.06)source hard-max -0.15 (-0.35, 0.06)

soft-threshold -0.15 (-0.35, 0.06)source:self-efficacy hard-max 0.03 (0.00, 0.06)*

soft-threshold 0.03 (0.00, 0.06)*story hard-max 0.05 (-0.01, 0.11)

soft-threshold 0.05 (-0.01, 0.11)story:education hard-max -0.07 (-0.13, -0.01)*

soft-threshold -0.07 (-0.13, -0.01)*

What have we learned as a policy-maker?

Source by Self-efficacy Interaction

The “highly personalized” level of source is more effective forsubjects with a higher self-efficacy (7 or above on a 1-10 scale).

What have we learned as a policy-maker?

Story by Education Interaction

The “deeply tailored” level of story is more effective for subjects withless education (≤ high school).



Discussion

Outline

1 Introduction




5 Discussion

35 / 46



Discussion

Discussion

Soft-threshold estimator reduces bias and improves coverage ofCIs in estimation of optimal policies via Q-learning undernon-regular settings.

It can be used for both randomized and observational data.

The original hard-max estimator should be preferred over thesoft-threshold estimator when the stage 2 treatments are “toodifferent” (so-called regular settings).

– However regular settings are unlikely to occur in clinical trialdata due to ethical considerations (Freedman, NEJM, 1987).

36 / 46



Discussion

More Sophisticated Bootstrap CIs

We tried double bootstrap (Davison and Hinkley, 1997;Nankervis, 2005) with the hard-max estimator.

– It gave valid coverage rates of CIs.

– It is computationally much more expensive.

We did not try m-out-of-n bootstrap (Shao, 1994).

– Choice of m in the present context is not obvious.

– It gives slower rate of convergence than√

n even in regularsettings.

37 / 46



Discussion

Ongoing and Future Work

Does the soft-threshold estimator have lower MSE (risk) than thehard-max estimator?

– Need to derive some theoretical optimality result, similar to theresults in the area of wavelet shrinkage (e.g., Donoho andJohnstone, Biometrika, 1994; Johnstone and Silverman, Annals ofStatistics, 2004).

How does the method generalize to more than two treatments(actions) and more than two stages?

To what extent does the problem of non-regularity arise in caseof more sophisticated regression models (e.g., regression trees,splines, neural networks, ...)?

38 / 46



Discussion

Contribution

“Evidence-based” treatment policies are of great interest inmedicine – we tried to address some need of this community.

We developed a shrinkage and thresholding scheme to reducebias and to construct valid confidence intervals for somenon-regular parameters of practical importance.

Our work offers a way of improving Q-learning (FittedQ-iteration) that can benefit the Reinforcement learningcommunity.

Our method can be potentially useful in other multistagedecision problems, e.g., market intelligence.

39 / 46

Main References

L. Breiman (1995). Technometrics, 37(4): 373-384.

B. Chakraborty, V. Strecher, and S. Murphy (2008). Tentativelyaccepted by Statistical Methods in Medical Research.

B. Chakraborty, V. Strecher, and S. Murphy (2008). NIPS Workshop onModel Uncertainty and Risk in Reinforcement Learning.

M. Figueiredo and R. Nowak (2001). IEEE Transactions on ImageProcessing, 10(9): 1322-1331.

E. Moodie and T. Richardson (2008). To appear in ScandinavianJournal of Statistics.

J. Robins (2004). Proceedings of the Second Seattle Symposium onBiostatistics, New York, Springer.

V. Strecher, J. McClure, G. Alexander, B. Chakraborty, V. Nair, et al.(2008). American Journal of Preventive Medicine, 34(5): 373-381.

C. Watkins (1989). PhD Thesis, Cambridge University.

Questions? Comments?

Thank you!

Q-learning and Structural Nested Mean Models

A bridge between Machine Learning and Biostatistics Literature...

LemmaConsider linear models for the Q-functions, and assume that:

(i) the parameters in Q1 and Q2 are distinct;

(ii) At has zero conditional mean given Ht, t = 1, 2; and

(iii) the covariates used in the model for Q1 are nested within thecovariates used in the model for Q2, i.e., (HT

10,HT11A1) ⊂ HT

20.

Then Q-learning is algebraically equivalent to an inefficient versionof Robins’ method of estimation in Structural Nested Mean Models.

42 / 46

Non-regularity in Other Areas

Inference based on shrinkage estimators:Sen and Saleh (1987), Leeb and Potscher (2006)

Post-model-selection inference:Kabaila (1995), Leeb and Potscher (2005)

Inference in autoregressive models in time series:Stock(1991), Andrews (1993)

Inference on eigenvalues of a covariance matrix:Beran and Srivastava (1985, 1987)

And many more ...

43 / 46

Hard-threshold Estimator

YHT1i = R1i + βT

2 H20,i + |ψT2 H21,i| · 1

{|ψT

2 H21,i| > λi

}, λi > 0

The corresponding estimator ψHT1 is the hard-threshold estimator.

44 / 46


The hard-threshold estimator reduces bias of the original“hard-max” estimator (Moodie and Richardson, 2008).

The hard-threshold pseudo-outcome has two points ofdiscontinuity!

No data-driven choice of λi is currently available.

45 / 46


Usually λi is set to be equal to zα/2

√HT

21,iΣ2H21,i/n, where

Σ2/n is the estimated covariance matrix of ψ2, and α is anunknown tuning parameter to be specified by the user.

YHT1i = R1i + βT

2 H20,i + |ψT2 H21,i| · 1

{ √n|ψT

2 H21,i|√HT

21,iΣ2H21,i

> zα/2

}

So the tuning parameter is α.

In simulations, we use α = 0.08, which we empirically found towork well.

46 / 46

Date post:	22-Aug-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Multistage Decision Policies via Q-learning: Bias …...Multistage Decision Policies via Q-learning:...

Documents