Lecture 5: Sequential Multiple Assignment Randomized ...dzeng/BIOS740/Lecture5.pdf · (Fidias et...

Lecture 5: Sequential MultipleAssignment Randomized Trials

(SMARTs) for DTR

Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction


Introduction

I Consider simple DTRs: D = (D1, ...,DK)

Dk(Hk) = 1 or −1 (Ak = {−1, 1}).

That is, a fixed treatment is assigned to any patients atstage k.

I How shall we evaluate V(D)?Conduct a single-arm randomized trial.

I How shall we compare two and more different DTRs: D1vs D2 (not necessarily simple ones)?Conduct a multiple-arm randomized trial.


Randomization Is Key

I Why randomization is needed?I It provides representative coverage of the patient

population.I It controls all potential confounders which may bias the

effects of comparing treatments.

I One essential mathematical relationship links observedoutcome to potential outcome:

E[R(d)|H] = E[R(d)|A = d,H] by randomization= E[R|A = d,H] by consistency assumption.

Consistency/SUTVA: R = R(d) if A = d and is not affectedby the particular treatment assignments to the others.


Difficulty in Design Studies for Optimal DTRs

I Even for simple DTRs and only two treatment options ateach stage, the total number of DTRs is 2K.

I The optimal DTRs permit treatment decisions to dependon patient characteristics, Hk, at each stage k so the totalnumber of DTRs is infinite.

I Some smart design is needed.


Learn from Artificial Intelligence


What is Reinforcement Learning in AI

I Reinforcement learning (RL) is a machine learning methodin artificial intelligence.

I Its history dates back to decades ago when AI engineerstried to mimic learning process in human brains.

I Different from commonly known machine learning, RL is alearning method of interactions between agent andunknown environment with feedback loop.

I It consists of exploitation and exploration.Exploitation: apply already learnt decisionsExploration: trial and learn from new environment.


Link between RL and DTR

I They both involve sequential decisions.

I They share the same goal:find optimal decisions to maximize some value.

I Equivalence between DTR and RL terminologies:treatment == actionstate/covariate==statereward outcome==rewardtreatment rule==policy


How RL Designs Studies in AI

I In artificial intelligence (AI) , a single agent (Robot) trainsand learns by itself for a particular task, for example, playchess to win.

I A typical design for this learning is a trial and errorlearning process.

I The single agent tries many different episodes, each with agiven policy.

I It leans from these episodes and also updates policies innew episodes.


How Different RL Should Be Used in Medical Trials

I Instead of training one single agent with many policies, we“train”many patients, each given potentially different treatments.

I Randomization is necessary to have sufficientrepresentation and avoid selection for treatments.

I This is essentially the idea of Sequential MultipleAssignment Randomized Trial (SMART).


SMART


SMART

SMART: Sequential Multiple Assignment Randomized Trial(Lavori & Dawson 2000, 2004; Murphy 2005)

I Patients are sequentially randomized at each criticaldecision stage.

I Randomization probability may depend on current statesof patients.

I Practical SMART– Adaptive Pharmacological and Behavioral treatments forADHD;– Sequenced Treatment Alternatives to Relieve Depression(STAR*D) ;– CATIE for schizophrenia;– ExTENd for alcohol dependence;– Adaptive therapy for androgen independent prostate cancer


Example: Two-Stage SMART Study

I The study (Kasari et al., 2014) was designed to studycommunication intervention for minimally verbal childrenwith autism.

I The study aimed to test the effect of SGD, each stagelasting 12 weeks.

I SGD: speech-generating device; (JASP+EMT): blendeddevelopmental/behavioral intervention

I The second stage had another 12 week follow-up.I The study started with 61 eligible children and 46

completed both stages.


Diagram of The Autism Study

12 weeks in duration. At the beginning of Stage 1(baseline), all children meeting inclusion criteria wererandomized with equal probability to JASPþEMTversus JASPþEMTþSGD. At the end of 12 weeks,children were assessed for early response versus slowresponse (defined in section “Stage 2 Treatments:Weeks 13 to 24”) to Stage 1 treatment. At the begin-ning of Stage 2 (i.e., beginning of week 13), the sub-sequent treatments were adapted based on responsestatus. All early responders continued with the sametreatment for another 12 weeks. For slow respondersto JASPþEMTþSGD, treatment was intensified(3 sessions per week). Slow responders to JASPþEMTwere re-randomized with equal probability to inten-sified JASPþEMT or augmented JASPþEMTþSGD(Figure 1). The institutional review board at eachsite approved the study protocol. Randomizationwas conducted by an independent data-coordinatingcenter.

ParticipantsInclusion criteria were as follows: previous clinicaldiagnosis of ASD, confirmed by research-reliable staffusing the Autism Diagnostic Observational Schedule(ADOS-Generic)20 Module 1 appropriate for childrenwithout phrase speech; chronological age between 5and 8 years; evidence of being minimally verbal, withfewer than 20 spontaneous different words used dur-ing the 20-minute NLS; at least 2 years of prior inter-vention, per parent-report; and receptive language ageof at least 24 months (based on performance of 2 of 3assessments, given the potential difficulty complyingwith standardized test conditions. Exclusion criteriawere as follows: major medical conditions other thanASD; sensory disabilities (e.g., deafness); motor dis-abilities (e.g., cerebral palsy); uncontrolled seizuredisorders; and proficient use of an SGD based onparent-report and observation during study adminis-tration of the NLS.

FIGURE 1 Participant flow through trial. Note: JASPþEMT ¼ spoken mode of JASPER plus Enhanced Milieu Teaching;JASPþEMTþSGD ¼ spoken mode of JASPER plus Enhanced Milieu Teaching plus Speech Generating Device.

JOURNAL OF THE AMERICAN ACADEMY OF CHILD & ADOLESCENT PSYCHIATRY

VOLUME 53 NUMBER 6 JUNE 2014 www.jaacap.org 637

SMART FOR MINIMALLY VERBAL CHILDREN WITH ASD

Figure: SMART Design of Autism Study (Kasari et al. 2014)


Advantages of SMART

Research questions to be answered from a SMARTI Main effects of treatments

I What is the better initial treatment, JASP+EMT orJASP+EMT+SGD?

I What about the slow-responders: intensify or not?

I Effects of embedded DTRI JASP+EMT→intensify vs JASP+EMT+SGD→ intensify vs

JASP+EM→JASP+EMT+SGD

I Exploring optimal treatment strategy (deep tailoring)I intensify or not in the second stage dependent on

additional intermediate outcomes?


General Advantages

I Valid comparisons of different treatment options atdifferent stages due to the virtue of randomization.

I Discover adaptive treatment strategies that are embeddedin the SMART trial.

I Inform the development of adaptive and deeply tailoredtreatments (using potentially high-dimensionalbiomarkers).


Unbiased Value Estimation under SMART

Recall the value function associated with D = (D1, ...,DK)

V(D) = E[R(D)] = E[R(D1, ...,DK)].

Let πk(a, h) be the randomization probability P(Ak = a|Hk = h).I Start from stage K:

E[R(D1, ...,Dk−1,DK)]

= E{

E[R(D1, ...,DK)

∣∣∣HK,AK = DK(Hk)]}

= E{

E[

R(D1, ...,Dk−1,AK)I(AK = DK(HK))

πK(AK,HK)

∣∣∣HK

]}= E

[R(D1, ...,Dk−1,AK)

I(AK = DK(HK))

πK(AK,HK)

].


Continue to Prior Stages

I Now at stage K − 1, we continue and repeat the samederivation to obtain

E[R(D1, ...,Dk−1,DK)]

= E[

R(D1, ...,AK−1,AK)I(AK−1 = DK−1(HK−1),AK = DK(HK))

πK−1(AK−1,HK−1)πK(AK,HK)

].

I Continue backwards till stage 1:

V(D) = E

[R(A1, ...,AK−1,AK)

I(A1 = D1(H1),AK = DK(HK))∏Kk=1 πk(Ak,Hk)

]

= E

[R


].

I The value function for D can be estimated using theaverage reward outcomes from the patients whosetreatments follow D, weighted by their randomizationprobabilities.


An Alternative Interpretation

I In SMART, a particular treatment assignment takesprobability

K∏k=1

πk(Ak,Hk).

I If we run a single-arm trial for a given D, the treatmentassignment is

∏Kk=1 I(Ak = Dk(Hk)).

I To use SMART to estimate the expected reward in theD-specified trial, important sampling theory gives

V(D) = E

[R


].

I In conclusion, SMART provides unbiased estimation andcomparisons between DTRs so potentially leads to validestimation of optimal DTRs.


Demo of SMART Trial


NSCLC trial

In treating advanced non-small cell lung cancer, patientstypically experience two or more lines of treatment, and manystudies demonstrate that three lines of treatment can improvesurvival for patients.

1st-line 2nd-line 3rd-line


1st-line therapy

Scagliotti et al., 2008{Cisplatin + GemcitabineCisplatin + Pemetrexed

Sandler et al., 2006{Paclitaxel + CarboplatinPaclitaxel + Carboplatin + Bevacizumab (Avastin)

Pirker et al., 2008{Cisplatin + VinorelbineCisplatin + Vinorelbine + Cetuximab (Erbitux)


2nd-line therapy

Approved regimens

DocetaxelPemetrexedErlotinib (approved for third-line, too)

Timing issue

I Immediate or delayed docetaxel? (Fidias et al., 2007;Ciuleanu et al., 2008)

I Comparison of PFS suggests immediate, but OS did notreach significant difference.

I Merits of pemetrexed and erlotinib remain unclear.


Important clinical questions

1. Among many approved 1st-line treatments, whattreatment to administer?

2. Then, at the end of the 1st-line treatmentI Among approved 2nd-line treatments, what treatment to

administer?I When to begin the 2nd-line of treatment?

Possibletreatments

Possibletreatmentsand initialtimings

1st-line 2nd-line

Immediate Progression Death

1


Reinforcement Learning

I Has a well-known framework for optimizing sequence oftreatments in an evolving, time-varying system.

I Discovers individualized treatment regimens for cancer.I Selects treatments that improve outcomes even when the

relationship between treatments and outcomes is not fullyknown.

I Can evaluate treatments based on immediate andlong-term effects.

Applications:I Behavioral disorders (Pineau et al., 2007).I Sequential multiple assignment randomized trial (SMART)

designs for developing dynamic treatment regimes(Murphy et al., 2007).


Clinical Reinforcement Trial

Drafted Protocol

1. A finite, reasonably small set of decision times is identified.2. For each decision time, a set of possible treatments to be

randomized is identified.3. A utility function is identified which can be assessed at

each time point.4. Patients are then recruited into the study and randomized

to the treatment set under the protocol restrictions at eachdecision point.

5. The patient data is collected and Q-learning is applied, incombination with SVR applied, to estimate the optimaltreatment rule as a function of patient variables andbiomarkers, at each decision time.


Reinforcement Learning Trial

t1

T1

YD = 0, δ = 1, T 0 = T1

t1

C

YD = 0, δ = 0, T 0 = C

t2

T2

t1

(t2 + TM ) ∧ TP

YD = 1, δ = 1, T 0 = t2 + T2

t2

C

t1

(t2 + TM ) ∧ TP

YD = 1, δ = 0, T 0 = C

1

I T1 = TD ∧ t2, YD = I(TD ∧ C ≥ t2)

I T2 = (TD − t2)I(TD ≥ t2) = (TD − t2)I(T1 = t2)

I C2 = (C− t2)I(C ≥ t2)

I TD = T1 + T2, T0 = TD ∧ C = T1 ∧ C + YD(T2 ∧ C2)


Learning Algorithm

1. At t1, (w1,m1, d1,T1 ∧ C,∆)ni=1;

at t2, (w2,m2, d2,TM,T2 ∧ C2,∆)n′

j=1, where n′ ≤ n.2. Q2, the expected reward at stage 2, is estimated using

regression methods with outcome

T2 ∧ C2.

3. Q1, the expected optimal reward at stage 1, is estimatedusing regression methods with outcome

T1 ∧ C + I(T1 = t2)×maxd2,TM

Q2.

4. In step 2 & 3, regression methods can include machinelearning algorithms (SVR).

5. Given Q1(θ̂1) and Q2(θ̂2), compute D̂1 and D̂2 bymaximizing Q1 and Q2, respectively.


Simulation Settings

I 1st-line treatment regimens: A1 or A2

I 2nd-line treatment regimens: A3 or A4

I initiation time for 2nd-line treatment: random timebetween 2.8 and 4.8 months from 1st-line treatment

I prognostic factors: Wt (quality of life), Mt (tumor size)I four groups are generated along with their optimal

treatment regimens


Simulation scenario table

Table 1: The scenarios studied in the simulation. Sample size = 100/group.

Group State Variables Status Timing Optimal Regimen

1W1 ∼ N(0.25, σ2)

W1 ↓M1 ↑ A1A32M1 ∼ N(0.75, σ2)

2W1 ∼ N(0.75, σ2)

W1 ↑M1 ↑ A1A41M1 ∼ N(0.75, σ2)

3W1 ∼ N(0.25, σ2)

W1 ↓M1 ↓ A2A33M1 ∼ N(0.25, σ2)

4W1 ∼ N(0.75, σ2)

W1 ↑M1 ↓ A2A42M1 ∼ N(0.25, σ2)

1


Performance of the optimal regimen

9.23 10.39 9.04 9.59 10.25 9.12 10.53 11.29 10.31 9.15 9.75 8.90 17.48

Overall Survival

05

1015

2025

A1A31 A1A32 A1A33 A1A41 A1A42 A1A43 A2A31 A2A32 A2A33 A2A41 A2A42 A2A43 optimal


Performance of the optimal regimen

Table: Comparisons between true optimal regimens and estimated optimalregimens for overall survival (month). Each reinforcement trial is of size100/group with 10 simulation runs. The confirmatory trial is of size100/group.

Optimal True Predicted survivalGroup regimen survival Min Mean Max

1 A1A32 16.00 15.83 15.93 16.002 A1A41 15.33 14.96 15.13 15.283 A2A33 18.37 17.75 17.99 18.274 A2A42 20.75 20.60 20.86 20.97

Average 17.61 17.28 17.48 17.63


A1A32

Group 1 (92%, 100%)

Fre

quen

cy

3.0 3.5 4.0 4.5

020

4060

8010

0A1A41

Group 2 (100%, 100%)

Fre

quen

cy

3.0 3.5 4.0 4.5 5.0 5.5

020

4060

8010

0

A2A33

Group 3 (100%, 100%)

Fre

quen

cy

3.0 3.5 4.0 4.5 5.0

020

4060

8010

0

A2A42

Group 4 (100%, 81%)

Fre

quen

cy

3.0 4.0 5.0 6.0

020

4060

8010

0

Sensitivity of the predicted survival to the sample size

●●

Sample Size for Each Group

Ove

rall

Sur

viva

l

2 3 6 10 20 40 60 80 100 150 200 300 400 500 600

1314

1516

17


ε-SVR-C performance

●

(a)

Pre

dict

ed O

vera

ll S

urvi

val

None 25% delete

15.0

15.5

16.0

16.5

17.0

17.5

18.0

●

(b)

Pre

dict

ed O

vera

ll S

urvi

val

None 50% delete

15.0

15.5

16.0

16.5

17.0

17.5

18.0

●

(c)P

redi

cted

Ove

rall

Sur

viva

l

None 75% delete

15.0

15.5

16.0

16.5

17.0

17.5

18.0


Summary

I Identified optimal treatment strategies tailored to propersubpopulation of NSCLC patients.

I Solved timing problem of initiating second-line therapy inNSCLC.

I Handled right censored data.


Design Limitations


Practical limitations of SMART

I Operation cost of administrating multiple stage studiesand multiple treatments is high.

I The length of trial period is long (March et al. 2010).

I Study dropout or compliance is common even in regularRCTs:

I In the CATIE study, 705 of 1460 patients stayed for theentire 18 months of the study.

I In ExTENd, the drop-out rate was 17% (52 out of 302) in thefirst-stage treatment and an additional 13% (41 out of 302)during the second stage.


Date post:	09-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Lecture 5: Sequential Multiple Assignment Randomized ...dzeng/BIOS740/Lecture5.pdf · (Fidias et...

Documents