+ All Categories
Home > Documents > TLA-1906126

TLA-1906126

Date post: 08-Nov-2014
Category:
Upload: juan-pablo-salazar-arias
View: 11 times
Download: 4 times
Share this document with a friend
Popular Tags:
35
Test, Learn, Adapt: Developing Public Policy with Randomised Controlled Trials Laura Haynes Owain Service Ben Goldacre David Torgerson
Transcript
Page 1: TLA-1906126

Test, Learn, Adapt:Developing Public Policy withRandomised Controlled Trials

Laura Haynes

Owain Service

Ben Goldacre

David Torgerson

Page 2: TLA-1906126

Dr Laura Haynes is Head of Policy Research at the Behavioural Insights Team.Laura leads the development of randomised controlled trials in a range of policyareas to identify cost-effective policy solutions grounded in behavioural science.Laura has a PhD in Experimental Psychology from the University of Cambridgeand is a Visiting Researcher at Kingʼs College London.

Owain Service is the Deputy Director of the Behavioural Insights Team. Hestudied social and political sciences at Cambridge, and has spent most of hiscareer working on public policy in the Prime Ministerʼs Strategy Unit. He has alsoworked for the Foreign Office in Brussels and for the National Security Secretariatin the UK.

Dr Ben Goldacre trained in medicine and epidemiology and is now a ResearchFellow at London School of Hygiene and Tropical Medicine, working on problemsin clinical trials. He is the author of Bad Science (4th Estate), a freelancejournalist, and has made various documentaries on science, medicine, and policyfor BBC Radio 4.

Professor David Torgerson is Director of the York Trials Unit. He has a wideinterest in randomised trials including those in policy. He has undertakensubstantial numbers of clinical trials but also non-medical trials including trials ineducation; criminal justice and general policy. He has published widely on themethods and methodology of randomised controlled trials.

AcknowledgementsWe would like to thank government departments for sharing their recent researchinvolving field trials with us. Weʼd also like to thank Professor Peter John, ProfessorRachel Glennester, Professor Don Green, Dr David Halpern and other members of theBehavioural Insights Team for their feedback on this paper, and also Michael Sanders forediting the document.

Page 3: TLA-1906126

Executive Summary..............................................................................4

Introduction..........................................................................................6

Part 1 -What is an RCT and why are they important?....8

What is a randomised controlled trial?..........................................8

The case for RCTs - debunking some myths:...............................13

1.We don’t necessarily know ‘what works’.........................................................15

2. RCTs don’t have to cost a lot of money..........................................................15

3. There are ethical advantages to using RCTs...................................................16

4. RCTs do not have to be complicated or difficult to run..............................18

PART II - Conducting an RCT: 9 key steps.........................20

Test

Step1: Identify two or more policy interventions to compare........................20

Step 2: Define the outcome that the policy is intended to influence.............22

Step 3: Decide on the randomisation unit........................................................... 24

Step 4: Determine how many units are required for robust results..............26

Step 5: Assign each unit to one of the policy interventions using a robustlyrandom method..........................................................................................................27

Step 6: Introduce the policy interventions to the assigned groups.................28

Learn

Step 7: Measure the results and determine the impact of the policyinterventions................................................................................................................30

Adapt

Step 8: Adapt your policy intervention to reflect your findings......................31

Step 9: Return to step 1............................................................................................32

Contents

Page 4: TLA-1906126

Randomised controlled trials (RCTs)are the best way of determiningwhether a policy is working. They arenow used extensively in internationaldevelopment, medicine, and businessto identify which policy, drug or salesmethod is most effective. They arealso at the heart of the BehaviouralInsights Teamʼs methodology.However, RCTs are not routinely usedto test the effectiveness of publicpolicy interventions in the UK. Wethink that they should be.What makes RCTs different from othertypes of evaluation is the introductionof a randomly assigned control group,

which enables you to compare theeffectiveness of a new interventionagainst what would have happened ifyou had changed nothing.The introduction of a control groupeliminates a whole host of biases thatnormally complicate the evaluationprocess – for example, if you introducea new “back to work” scheme, how willyou know whether those receiving theextra support might not have found ajob anyway?In the fictitious example below in Figure1, we can see that those who receivedthe back to work intervention were muchmore likely to find a job than those who

Executive Summary

Figure 1. The basic design of a randomised controlled trial (RCT),illustrated with a test of a new ʻback to workʼ programme.

4 Test, Learn, Adapt

Page 5: TLA-1906126

did not. Because we have a controlgroup, we know that it is theintervention that achieves the effectand not some other factor (such asgenerally improving economicconditions).With the right academic and policysupport, RCTs can be much cheaperand simpler to put in place than isoften supposed. By enabling us todemonstrate just how well a policy isworking, RCTs can save money in thelong term - they are a powerful tool tohelp policymakers and practitionersdecide which of several policies is themost cost effective, and also whichinterventions are not as effective asmight have been supposed. It isespecially important in times ofshrinking public sector budgets to beconfident that public money is spenton policies shown to deliver value formoney.We have identified nine separatesteps that are required to set up anyRCT. Many of these steps will befamiliar to anyone putting in place awell-designed policy evaluation – forexample, the need to be clear, fromthe outset, about what the policy isseeking to achieve. Some – inparticular the need to randomlyallocate individuals or institutions todifferent groups which receive differenttreatment – are what lend RCTs theirpower. The nine steps are at the heartof the Behavioural Insights Teamʼsʻtest, learn, adaptʼ methodology, whichfocuses on understanding better whatworks and continually improving policyinterventions to reflect what we havelearnt. They are described in the boxadjacent.

5

Test1. Identify two or more policyinterventions to compare (e.g. old vsnew policy; different variations of apolicy).2. Determine the outcome that thepolicy is intended to influence andhow it will be measured in the trial.3. Decide on the randomisation unit:whether to randomise to interventionand control groups at the level ofindividuals, institutions (e.g. schools),or geographical areas (e.g. localauthorities).4. Determine how many units(people, institutions, or areas) arerequired for robust results.5. Assign each unit to one of thepolicy interventions, using a robustrandomisation method.6. Introduce the policy interventionsto the assigned groups.

Learn7. Measure the results and determinethe impact of the policy interventions.

Adapt8. Adapt your policy intervention toreflect your findings.9. Return to Step 1 to continuallyimprove your understanding of whatworks.

Page 6: TLA-1906126

Randomised controlled trials (RCTs)are the best way of determiningwhether a policy is working. They havebeen used for over 60 years tocompare the effectiveness of newmedicines.1 RCTs are increasinglyused in international development tocompare the cost effectiveness ofdifferent interventions for tacklingpoverty.2,3 And they are also employedextensively by companies, who wantto know which website layoutgenerates more sales. However, theyare not yet common practice in mostareas of public policy (See Figure 2).

This paper argues that we should andcould use RCTs much moreextensively in domestic public policy totest the effectiveness of new andexisting interventions and variationsthereof; to learn what is working andwhat is not; and to adapt our policiesso that they steadily improve andevolve both in terms of quality andeffectiveness.Part I of this paper sets out what anRCT is and why they are important. Itaddresses many of the commonarguments against using RCTs inpublic policy and argues that trials are

Introduction6 Test, Learn, Adapt

Figure 2. 20th century RCTs in health and in social welfare, education, crime and justice4

Page 7: TLA-1906126

not as challenging to put in place as isoften assumed, and that they can behighly cost-effective ways ofevaluating policy outcomes andassessing value for money.Part II of the paper outlines 9 keysteps that any RCT needs to have inplace. Many of these steps should befundamental to any policy initiative,others will require support fromacademics or centres of expertisewithin government.The ʻtest, learn, adaptʼ philosophy setout in this paper is at the heart of theway that the Behavioural InsightsTeam works. We believe that a ʻtest,test,learn, adaptlearn, adaptʼ approach has thepotential to be used in almost allaspects of public policy:

.Testing an intervention meansensuring that you have put in placerobust measures that enable you toevaluate the effectiveness orotherwise of the intervention.

.Learning is about analysing theoutcome of the intervention, so thatyou can identify ʻwhat worksʼ andwhether or not the effect size is greatenough to offer good value formoney.

.Adapting means using this learningto modify the intervention (ifnecessary), so that we are continuallyrefining the way in which the policy isdesigned and implemented.

7

Page 8: TLA-1906126

What is a RandomisedControlledTrial?

Often we want to know which of two ormore interventions is the most effective atattaining a specific, measurable outcome.For example, we might want to comparea new intervention against normal currentpractice, or compare different levels of“dosage” (e.g. home visits to a teenageexpectant mother once a week, or twice aweek) against each other.Conventionally, if you want to evaluatewhether an intervention has a benefit, yousimply implement it, and then try toobserve the outcomes. For example, youmight establish a high intensity “back towork” assistance programme, andmonitor whether participants come offbenefits faster than before theprogramme was introduced.However, this approach suffers from arange of drawbacks which make it difficultto be able to identify if it was theintervention that had the effect or someother factor. Principal amongst these areuncontrolled, external factors. If there is

strong economic growth, for example, wemight expect more people to findemployment regardless of our newintervention.Another, trickier analytical challenge isdealing with so-called “selection bias”; thevery people who want to participate in aback to work programme aresystematically different to those who donot. They may be more motivated to findwork, meaning that any benefits of thenew intervention will be exaggerated.There are statistical techniques whichpeople use to try and account for any pre-existing differences between the groupswho receive different interventions, butthese are always imperfect and canintroduce more bias.Randomised controlled trials get aroundthis problem by ensuring that theindividuals or groups of people receivingboth interventions are as closely matchedas possible. In our “back to workprogramme” example, this might involveidentifying 2000 people who would all beeligible for the new programme andrandomly dividing them into two groups of1000, of which one would get the normal,

Part IWhat is an RCT andwhy are theyimportant?

8 Test, Learn, Adapt

Page 9: TLA-1906126

current intervention and the other wouldget the new intervention. By randomlyassigning people to groups we caneliminate the possibility of external factorsaffecting the results and demonstrate thatany differences between the two groupsare solely a result of differences in theinterventions they receive.Part II of this paper describes in moredetail how to run a randomised controlledtrial, but at the heart of any RCT are anumber of key elements. RCTs work bydividing a population into two or moregroups by random lot, giving oneintervention to one group, the other to

another, and measuring the pre-specifiedoutcome for each group. This process issummarised in Figure 3 above.Let us imagine that we are testing a new“back to work” programme which aims tohelp job seekers find work. Thepopulation being evaluated is divided intotwo groups by random lot. But only one ofthese groups is given the newintervention (ʻthe intervention groupʼ), inthis case the “back to work” programme.The other group (the ʻcontrol groupʼ) isgiven the usual support that a jobseekerwould currently be eligible for. In thiscase, the control group is akin to a

Figure 3. Illustration of a randomised controlled trial (RCT) to test a new ʻback to workʼprogramme (positive outcome).

Figure 4. Illustration of a randomised controlled trial (RCT) to test a new ʻback to workʼprogramme (neutral outcome).

9

Page 10: TLA-1906126

Box 1: Demonstrating the impact of text messaging on fine repaymentsThe Courts Service and the Behavioural Insights Team wanted to test whether or notsending text messages to people who had failed to pay their court fines wouldencourage them to pay prior to a bailiff being sent to their homes. The way this questionwas answered is a clear example of the “test, learn, adapt” approach, and theconcurrent testing of multiple variations to find out what works best.In the initial trial, individuals were randomly allocated to five different groups. Some weresent no text message (control group), while others (intervention groups) were sent eithera standard reminder text or a more personalised message (including the name of therecipient, the amount owed, or both).The trial showed that text message prompts can be highly effective (Figure 5).

A second trial was conducted using a larger sample (N=3,633) to determine whichaspects of personalised messages were instrumental to increasing payment rates. Thepattern of results was very similar to the first trial. However, the second trial enabled usto be confident not only that people were more likely to make a payment on theiroverdue fine if they received a text message containing their name, but that the averagevalue of fine repayments went up by over 30%.The two trials were conducted at very low cost: as the outcome data was already beingcollected by the Courts Service, the only cost was the time for team members to set upthe trial. If rolled out nationally, personalised text message reminders would improvecollection of unpaid fines; simply sending a personalised rather than a standard text isestimated to bring in over £3 million annually. The savings from personalised texts aremany times higher than not sending any text reminder at all. In addition to thesefinancial savings, the Courts Service estimates that sending personalised text reminderscould reduce the need for up to 150,000 bailiff interventions annually.

Figure 5. Initial trial: repayment rates by individuals (N=1,054)

Resp

onse

rate

10 Test, Learn, Adapt

Page 11: TLA-1906126

placebo condition in a clinical drug trial.In the example in Figure 3, jobseekerswho have found full time work 6 monthsinto the trial are coloured green. The trialshows that many more of the individualsin the new “back to work” programme arenow in work compared to those in thecontrol group.It is important to note that two stick figuresin the control group have also found work,perhaps having benefited from the normaljobseeker support provided to all those onbenefits.If the new “back to work” programme wasno better than the current serviceprovided to jobseekers, we would haveseen a similar pattern in both theintervention group and the control groupreceiving the current service. This isillustrated best in Figure 4, which shows adifferent set of results for our programme.Here, the results of the trial demonstratethat the new, expensive “back to work”programme is no better than currentpractice. If there had been no controlgroup, we might have seen peoplegetting jobs after taking part in the new“back to work” programme, and wronglyconcluded that they had done so becauseof the programme itself. This might haveled us to roll out the new, expensive (andineffective) intervention. A mistake like thiswas avoided by the DWP in a real lifeRCT looking at the cost-effectiveness ofdifferent types of interventions (see Box2).Wherever there is the potential forexternal factors to affect the outcomes ofa policy, it is always worth consideringusing an RCT to test the effectiveness ofthe intervention before implementing it inthe whole population. When we do not, itis easy to confuse changes that mighthave occurred anyway with the impact ofa particular intervention.Our fictitious “back to work” exampleassumes that we are interested principally

Box 2. Using RCTs to know whatreally works to people get backinto employment.In 2003, the Department for Workand Pensions (DWP) conducted anRCT to examine the impact of threenew programmes on IncapacityBenefit claimants: support at work,support focused on their individualhealth needs, or both.5,6 The extrasupport cost £1400 on average, butthe trial found no benefit over thestandard support that was alreadyavailable. The RCT ultimately savedthe taxpayer many millions ofpounds as it provided unambiguousevidence that the costly additionalsupport was not having the intendedeffect.More recently the DWP was keen toexplore whether the intensity of thesigning-on process required ofjobseekers on benefits could bereduced without worseningoutcomes.In a trial involving over 60,000people, the usual fortnightly signing-on process was compared againstseveral others which were lessresource intensive (e.g. signing-onby telephone, less frequently). All ofthe alternatives to the status quotested in trials large enough to showreliable effects were found toincrease the time people took to findwork.7 As a result, despite otherchanges to the benefits system,DWP policy continues to requirepeople to sign on a fortnightly basis.

11

Page 12: TLA-1906126

in understanding which of two large-scaleinterventions is working most effectively.In many cases, an RCT is not justinterested in the headline policy issue.Instead, it may be used to compareseveral different ways of implementingsmaller aspects of the policy.

As many of the other examples set out inthis paper show, one of the great thingsabout randomised controlled trials is thatthey also allow you to test theeffectiveness of particular aspects of awider programme. Testing small parts of a

programme enables policy makers tocontinually refine policy, honing in on theparticular aspect of the intervention whichis having the greatest impact.Regardless of whether we are comparingtwo large scale interventions or smalleraspects of a single policy, the same basicprinciples of an RCT hold true: bycomparing two identical groups, chosenat random, we can control for a wholerange of factors that enable us tounderstand what is working and what isnot.

12 Test, Learn, Adapt

Box 3: The link between theories of growth, innovation, and RCTsThe growing interest in the use of RCTs as an important tool of policymakingand practice resonate with broader currents of thinking. When money is shortit is essential to make sure that it is being spent on approaches that work, andeven small marginal improvements in cost effectiveness are precious. RCTʼsare an extremely powerful tool to pinpoint cost-effectiveness – and flush outlow-value spend.These methods also resonate strongly with emerging views on social andeconomic progress. Many leading thinkers have concluded that in complexsystems, from biological ecosystems to modern economies, much progress –if not most – occurs through a process of trial and error.8 Economies andecosystems that become too dominated by a narrow a range of practices,species or companies are more vulnerable to failure than more diversesystems.9,10 Similarly, such thinkers tend to be sceptical about the ability ofeven the wisest experts and leaders to offer a comprehensive strategy ormasterplan detailing ʻtheʼ best practice or answer on the ground (certainly ona universal basis). Instead they urge the deliberate nurturing of variationcoupled with systems, or dynamics, that squeeze out less effective variationsand reward and expand those variations that seem to work better.The practical expression of this thinking includes the drive for greaterdevolution of policy-making, and the harnessing of markets to deliver goodsand services. Encouraging variation needs to be matched by mechanismsthat identify and nurture successful innovations. This includes sharpeningtransparency and feedback loops in consumer markets and public services,noting that these lead to the selective expansion of better provision and oftenthe growth of smaller, independent provision.11 In public services, and wheremarkets and payment by results may be inappropriate, RCTs and multi-armtrials may play a powerful role here, especially where these results are widelyreported and applied.

Page 13: TLA-1906126

The case for RCTs:debunking some myths

There are many fields in whichrandomised trials are now commonpractice, and where failing to do themwould be regarded as bizarre, or evenreckless. RCTs are the universalmeans of assessing which of twomedical treatments works best,whether it is a new drug compared withthe current best treatment, two differentforms of cancer surgery, or even twodifferent compression stockings. Thiswas not always the case: when trialswere first introduced in medicine, theywere strongly resisted by someclinicians, many of whom believed thattheir personal expert judgement wassufficient to decide whether a particulartreatment was effective.RCTs are also increasingly being usedto investigate the effectiveness andvalue for money of various differentinternational development programmes(see Box 4). In business, whencompanies want to find out which oftwo webpage designs will encouragethe most “click-throughs” and sales, it iscommon to randomly assign websitevisitors to one of several websitedesigns, and then track their clicks andpurchasing behaviour (see Box 5).But while there are some goodexamples of policymakers using RCTsin the UK, they are still not inwidespread use. This may partly bedue to a lack of awareness, but thereare also many misunderstandingsabout RCTs, which lead to them beinginappropriately rejected.Here we go through each of thesemyths in turn, addressing the incorrectassumption that RCTs are always

Box 4. Using RCTs to improveeducational outcomes in IndiaOne of the areas of rapid growth in the useof RCTs in recent years has been ininternational development. Numerous trialshave been conducted to determine howbest to tackle poverty in the developingworld, from how to tackle low crop yields, tohow encourage use of mosquito nets,ensure teachers turn up at class, fosterentrepreneurship, and increase vaccinationrates.For example, effort in recent decades tomake education universally available indeveloping countries led to improvedschool enrolment and attendance.However, the quality of education availableto children from poor background remainsan issue: one 2005 survey across Indiaindicated that over 40% of children under12 could not read a simple paragraph, and50% couldn t̓ perform a simple subtraction.In partnership with an education NGO, USresearchers conducted an RCT todetermine whether a low cost, in schoolremedial education programme couldimprove school outcomes in India.Almost200 schools were randomly allocated toreceive a tutor for either their third or fourthgrade. The impact of the programme wasascertained by comparing grade 3outcomes for those schools with andwithout grade 3 tutors.The tutors were women from the localcommunity who were paid a fraction of ateacherʼs salary, and they workedseparately with groups of children whowere falling behind their peers for half of theschool day. Results indicated that theremedial programme significantly improvedtest scores, particularly in maths.12 Theprogramme was judged so successful (andcost effective relative to other programmesto improve school performance) that ithas been scaled up across India.

13

Page 14: TLA-1906126

difficult, costly, unethical, orunnecessary. We argue that it isdangerous to be overconfident inassuming that interventions areeffective, and that RCTs play a vital role

in demonstrating not only theeffectiveness of an intervention, butalso value for money.

14 Test, Learn, Adapt

Box 5. Using RCTs to improve business performanceMany companies are increasingly using RCT designs to test consumerresponses to different presentations of their products online. Little of thisinformation is publicly available, but it is well known that companies such asAmazon and eBay use routine web traffic on their sites to test out what worksbest to drive purchases. For example, some customers might view a particularconfiguration of a webpage, while others will view a different one. By trackingthe “click-throughs” and purchasing behaviour of customers who view thedifferent versions of the website, companies can tweak web page designs tomaximise profits. A few examples are provided below.During the recent Wikipedia fund-raising drive, a picture of the founder, JimmyWales, appeared in the donations advert at the top of the page: this was theresult of a series of trials comparing different designs of advert, delivering themrandomly to website visitors, and monitoring whether or not they donated.Netflix is a company that offers online movie streaming, and typically runsseveral user experience experiments simultaneously. When they were triallingthe “Netflix Screening Room”, a new way to preview movies, they produced fourdifferent versions of the service. These were each rolled out to four groups of20,000 subscribers, and a control group received the normal Netflix service.Users were then monitored to see if they watched more films as a result.13

Delta Airlines have also used experimentation to improve their website design.In 2006, while increasing numbers of people were booking their travel online,web traffic to Delta Airlinesʼ website was failing to generate the anticipatednumber of bookings. Almost 50% of the visitors to their website were droppingoff before completing the booking process: after selecting their flight, potentialcustomers often abandoned the booking when they reached the web pagerequiring input of their personal information (name, address, card details),Rather than changing the entire website, Delta focused on making changes tothe specific pages which failed to convert potential customers into sales.Numerous variations were tested online, randomly assigning customers todifferent versions of the webpages. Delta discovered that by removing detailedinstructions at the top of the page requesting their personal information,customers were much more likely to complete the booking process. As a resultof implementing this and other subtle design changes identified during thetesting process, conversion rates to ticket sales have improved by 5%14, a smallbut highly valuable change.

Page 15: TLA-1906126

1.We don’t necessarily know‘what works’

Policymakers and practitioners oftenfeel they have a good understandingof what interventions are likely to work,and use these beliefs to devise policy.Even if there are good grounds forbelieving a policy will be effective, anRCT is still worthwhile to quantify thebenefit as accurately as possible. Atrial can also help to demonstratewhich aspects of a programme arehaving the greatest effect, and how itcould be further improved. Forexample, if we were implementing anew programme for entrepreneursbased on start-up funding, it would beuseful to know whether doubling theamount of available funding has asignificant effect on success or makesno difference.We should also recognise thatconfident predictions about policymade by experts often turn out to beincorrect. RCTs have demonstratedthat interventions which weredesigned to be effective were in factnot (see Box 2). They have alsoshown that interventions about whichthere was initial scepticism wereultimately worthwhile. For example,when the Behavioural Insights Teamand the Courts Service looked atwhether text messaging mightencourage people to pay their courtfines, few predicted at the outset that apersonalised text would increaserepayment rates and amounts sosignificantly (see Box 1).But there are also countless examplesof RCTs that have overturnedtraditional assumptions about whatworks, and showed us thatinterventions believed to be effectivewere, in reality, harmful. The steroid

injection (see Box 6) case is apowerful example of how apparentlysound assumptions do not necessarilyhold true when finally tested. Similarly,the Scared Straight programme, whichexposes young people to the realitiesof a life of crime, is a good example ofa well-intentioned policy interventionwith an apparently sound evidencebase, but which RCTs have shownadverse effects (see Box 7). RCTs arethe best method we have for avoidingthese mistakes, by givingpolicymakers and practitioners robustevidence of the effectiveness of apolicy intervention, and ensuring thatwe know what would have happenedin the absence of the intervention.

2. RCTs don’t have to cost alot of money

The costs of an RCT depend on how itis designed: with planning, they can becheaper than other forms ofevaluation. This is especially truewhen a service is already beingdelivered, and when outcome data isalready being collected from routinemonitoring systems, as in many partsof the public sector. In contrast to trialsin medicine, a public policy trial will notnecessarily require us to recruitparticipants outside of normal practiceor to put new systems in place todeliver interventions or monitoroutcomes.The Behavioural Insights Team hasworked with a range of differentgovernment departments to run trialsat little additional cost to the time ofteam members. For example, in trialsthe team has run with local authorities,HMRC, DVLA, and the Courts Service(and is about to run with Job CentrePlus), the data is already being

15

Page 16: TLA-1906126

routinely collected and processes arealready in place to deliverinterventions, whether it is a letter, or afine, or an advisory service forunemployed people.When considering the additionalresources that might be required torun an RCT, we should remember thatthey are often the best way toestablish if a programme offers goodvalue for money. In some cases, a trialmay lead to us to conclude that aprogramme is too expensive to rollout, if the extra benefits of theintervention are negligible. In others, atrial may demonstrate that aprogramme delivers excellent value formoney, and so should be rolled outmore widely.By demonstrating how much more orless effective the intervention was thanthe status quo, policymakers candetermine whether the cost of theintervention justifies the benefits.Rather than considering how much anRCT costs to run, then, it might bemore appropriate to ask: what are thecosts of not doing an RCT?16

3.There are ethical advantagesto using RCTs

Sometimes people object to RCTs inpublic policy on the grounds that it isunethical to withhold a newintervention from people who couldbenefit from it. This is particularly thecase where additional money is beingspent on programmes which mightimprove the health, wealth, oreducational attainment of one group.It is true to say that it can bechallenging to withhold a treatment orintervention from someone that webelieve might benefit from it. This

16 Test, Learn, Adapt

Box 6. Steroids for head injury:saving lives, or killing people?For several decades, adults withsevere head injury were treatedusing steroid injections. This madeperfect sense in principle: steroidsreduce swelling, and it was believedthat swelling inside the skull killedpeople with head injuries, bycrushing their brain. However, theseassumptions were not subject toproper tests for some time.Then, a decade ago, this assumptionwas tested in a randomised trial. Thestudy was controversial, and manyopposed it, because they thoughtthey already knew that steroids wereeffective. In fact, when the resultswere published in 200515, theyshowed that people receiving steroidinjections were more likely to die: thisroutine treatment had been killingpeople, and in large numbers,because head injuries are socommon. These results were soextreme that the trial had to bestopped early, to avoid any additionalharm being caused.This is a particularly dramaticexample of why fair tests of new andexisting interventions are important:without them, we can inflict harmunintentionally, without ever knowingit; and when new interventionsbecome common practice withoutgood evidence, then there can beresistance to testing them in thefuture.

Page 17: TLA-1906126

paper does not argue that we shoulddo so when we know that anintervention is already proven to bebeneficial.However, we do argue that we need tobe clear about the limits of ourknowledge and that we will not becertain of the effectiveness of anintervention until it is tested robustly.Sometimes interventions which werebelieved to be effective turned out tobe ineffective or even actively harmful(see Boxes 6 and 7). This can even bethe case with policies that we mightintuitively think will be guaranteed towork. For example, incentives havebeen used to encourage adult learnersto attend literacy classes, but when anRCT of this policy was conducted, itwas found that participants receivingincentives attended approximately 2fewer classes per term than the non-incentive group.20

In this trial, using small incentives notonly wasted resources, it activelyreduced class attendance. Withholdingthe intervention was better than givingit out, and if a trial had never beenconducted, we could have done harmto adult learners, with the bestintentions, and without ever knowingthat we were doing so.It is also worth noting that policies areoften rolled out slowly, on a staggeredbasis, with some regions “going early”,and these phased introductions arenot generally regarded as unethical.The delivery of the Sure Startprogramme is an example of this.If anything, a phased introduction inthe context of an RCT is more ethical,because it generates new high qualityinformation that may help todemonstrate that an intervention iscost effective.

Box 7: The Scared StraightProgramme: Deterring juvenileoffenders, or encouraging them?“Scared Straight” is a programmedeveloped in the US to deter juveniledelinquents and at-risk children from acriminal behaviour. The programmeexposed children to the frighteningrealities of leading a life of crime,through interactions with seriouscriminals in custody.The theory was that these childrenwould be less likely to engage incriminal behaviour if they were madeaware of the serious consequences.Several early studies, which looked atthe criminal behaviours of participantsbefore and after the programme,seemed to support theseassumptions.17 Success rates werereported as being as high as 94%, andthe programme was adopted in severalcountries, including the UK.None of these evaluations had a controlgroup showing what would havehappened to these participants if theyhad not participated in the programme.Several RCTs set out to rectify thisproblem. A meta-analysis of 7 US trialswhich randomly assigned half of thesample of at-risk children to theprogramme and found that “ScaredStraight” in fact led to higher rates ofoffending behaviour: “doing nothingwould have been better than exposingjuveniles to the program”.18 Recentanalyses suggest that the costsassociated with the programme (largelyrelated to the increase in reoffendingrates) were over 30 times higher thanthe benefits, meaning that “ScaredStraight” programmes cost the taxpayera significant amount of money andactively increased crime.19

17

Page 18: TLA-1906126

4. RCTs do not have to becomplicated or difficult to run

RCTs in their simplest form are verystraightforward to run. However thereare some hidden pitfalls which meanthat some expert support is advisableat the outset.Some of these pitfalls are set out inthe next chapter, but they are nogreater than those faced in any otherform of outcome evaluation, and canbe overcome with the right support.This might involve, for example,making contact with the BehaviouralInsights Team. We can advise on trialdesign, put policy makers in touch withacademics who have experience ofrunning RCTs, and can help to guidethe design of a trial. Very often,academics will be happy to assist in aproject which will provide them withnew evidence in an area of interest totheir research, or the prospect of apublished academic paper.The initial effort to build inrandomisation, and clearly defineoutcomes before a pilot is initiated, isoften time well spent. If an RCT is notrun, then any attempt to try andevaluate the impact of an interventionwill be difficult, expensive, and biased- using complex models will berequired to try and disentangleobserved effects which could havemultiple external causes. It is muchmore efficient to put a smaller amountof effort into the design of an RCTbefore a policy is implemented.

18 Test, Learn, Adapt

Box 8: Family Nurse Partnership:building in rigorous evaluation toa wider roll out.The Family Nurse Partnership (FNP)is a preventative programme forvulnerable first time mothers.Developed in the US, it involvesstructured, intensive home visits byspecially trained nurses from earlypregnancy until the child is two yearsof age. Several US RCTs21 haveshown significant benefits fordisadvantaged young families andsubstantial cost savings. Forexample, FNP children have bettersocio-emotional development andeducational achievement and areless likely to be involved incrime. Mothers have fewersubsequent pregnancies and greaterintervals between births, are morelikely to be employed and less likelyto be involved in crime.FNP has been offered in the UKsince 2007, often through Sure StartChildrenʼs Centres, and theDepartment of Health has committedto doubling the number of youngmothers receiving support throughthis programme to 13,000 (at anyone time) in 2015. Meanwhile, theDepartment is funding an RCTevaluation of the programme, toassess whether FNP benefitsfamilies over and above universalservices and offers value for money.It involves 18 sites across the UK,and approximately 1650 women, thelargest trial to date of FNP. Reportingin 2013, outcomes measures includesmoking during pregnancy,breastfeeding, admissions to hospitalfor injuries and ingestions, furtherpregnancies and child developmentat age 2.

Page 19: TLA-1906126

How do you conduct aRandomised ControlledTrial?

Part I of this paper makes the case forusing RCTs in public policy. Part II ofthis paper is about how conduct anRCT. It does not attempt to becomprehensive. Rather, it outlines thenecessary steps that any RCT shouldgo through and points to those areasin which a policy maker may wish toseek out more expert advice.We have identified nine separatesteps that any RCT will need to put inplace. Many of these nine steps will befamiliar to anyone putting in place awell-designed policy evaluation – forexample, the need to be clear, fromthe outset, what the policy is seekingto achieve.Several, however, will be less familiar,in particular the need to randomlyallocate the intervention being testedto different intervention groups. Theseare summarised below and set out inmore detail in the sections that follow.

PART IIConducting an RCT:9 key steps

Test1. Identify two or more policyinterventions to compare (e.g. old vsnew policy; different variations of apolicy).2. Determine the outcome that thepolicy is intended to influence andhow it will be measured in the trial.3. Decide on the randomisation unit:whether to randomise to interventionand control groups at the level ofindividuals, institutions (e.g. schools),or geographical areas (e.g. localauthorities).4. Determine how many units(people, institutions, or areas) arerequired for robust results.5. Assign each unit to one of thepolicy interventions, using a robustrandomisation method.6. Introduce the policy interventionsto the assigned groups.Learn7. Measure the results and determinethe impact of the policy interventions.Adapt8. Adapt your policy intervention toreflect your findings.9. Return to Step 1 to continuallyimprove your understanding of whatworks

19

Page 20: TLA-1906126

Step 1: Identify two or morepolicy interventions tocompare

RCTs are conducted when there isuncertainty about which is the best oftwo or more interventions, and theywork by comparing these interventionsagainst each other. Often, trials areconducted to compare a newintervention against current practice.The new intervention might be a smallchange, or a set of small changes tocurrent practice; or it could be a wholenew approach which is proving to besuccessful in a different country orcontext, or that has sound theoreticalbacking.Before designing an RCT, it isimportant to consider what is currentlyknown about the effectiveness of theintervention you are proposing to test.It may be, for example, that RCTshave already been conducted insimilar contexts showing the measureto be effective, or ineffective. Existingresearch may also help to develop thepolicy intervention itself. A goodstarting point are the CampbellCollaboration archives22, whichsupport policymakers and practitionersby summarising existing evidence onsocial policy interventions.It is also important that trials areconducted on the very sameintervention that would be rolled out ifthe trial was successful. Often there isa temptation to run an RCT using anideal, perfect policy intervention, which

is so expensive that it could never berolled out nationwide. Even if we didhave the money, such an RCT wouldbe uninformative, because the resultswould not generalise to the real-worldpolicy implementation.We need to be sure that the results ofour trial will reflect what can beachieved should the policy be found tobe effective and then rolled out morewidely. In order for findings to begeneralisable, and relevant to thewhole country, the intervention mustbe representative, as should theeagerness with which practitionersdeliver it, and the way data iscollected.The Behavioural Insights Team, inconducting public policy RCTs, willusually spend a period of time workingwith front-line organisations to bothunderstand what is likely to befeasible, and to learn from staff whothemselves might have developedpotentially effective but untested newmethods for achieving public policyoutcomes.

Test20 Test, Learn, Adapt

Page 21: TLA-1906126

Box 9. Comparing different policyoptions & testing small variationson a policyAn RCT is not necessarily a testbetween doing something and doingnothing. Many interventions might beexpected to do better than nothing atall. Instead, trials can be used toestablish which of a number of policyintervention options is best.In some cases, we might be interestedin answering big questions aboutwhich policy option is mostappropriate. For example, imagine wehad the money to upgrade the ITfacilities in all secondary schools, orpay for more teachers, but not both.We might run a 3 arm trial (see figure6), with a control group (a number ofschools continuing with current IT andthe same number of teachers) and twointervention groups (schools whoreceived an IT upgrade or moreteachers). This would enable us todetermine whether either new policyoption was effective, and which

offered the best value for money.In other cases, we might be interestedin answering more subtle questionsabout a particular policy, such aswhich minor variation in delivery leadsto the best outcomes. For example,imagine that we are planning onmaking some changes to the foodserved in school canteens. We mightalready be bringing in some healthierfood options with the intention ofimproving childrenʼs diets. However,we know that presentation matters,and we arenʼt sure how best to lay outthe food options to encourage thehealthiest eating. We might run amulti arm trial, varying the way inwhich the food is laid out (the orderof salads and hot foods, the size ofthe ladles and plates, etc).Opportunities to fine tune policiesoften arise when we are about tomake changes – it is an ideal time totest out a few minor variations toensure that the changes we finallyinstitute are made to best effect.

Figure 6. The design of a hypothetical multi-arm RCT testing whether upgrading schoolsʼIT facilities (intervention 1) or employing more teachers (intervention 2) improves theschoolʼs academic performance.

21

Page 22: TLA-1906126

Step 2: Define the outcomethat the policy is intended toinfluence and how it will bemeasured in the trial

It is critical in any trial that we define atthe outset exactly what outcome weare trying to achieve and how we willmeasure it. For example, in thecontext of educational policy, anoutcome measure might beexamination results. For policiesrelated to domestic energy efficiency,an outcome measure could behousehold energy consumption.It is important to be specific about howand when the outcomes will bemeasured at the design stage of thetrial, and to stick with these pre-specified outcomes at the analysisstage. It is also critical to ensure thatthe way outcomes are measured forall the groups is exactly the same –both in terms of the process ofmeasurement and the standardsapplied.Pre-specifying outcome measuresdoes not just make good practicalsense. There are also good scientificreasons why it is crucial to thesuccess of a well-run RCT. This isbecause, over the course of time,there will always be randomfluctuations in routinely collected data.At the end of the trial, there may be alot of data on a lot of different things,and when there is so much data, it isinevitable that some numbers willimprove – or worsen – simply throughrandom variation over time.Whenever such random variationoccurs, it may be tempting to pick outsome numbers that have improved,

simply by chance, and view those asevidence of success. However, doingthis breaks the assumptions of thestatistical tests used to analyse data,because we are giving ourselves toomany chances to find a positive result.The temptation to over-interpret data,and ascribe meaning to randomvariation, is avoided by pre-specifyingoutcomes. Statistical tests can then be

22 Test, Learn, Adapt

Box 10: Taking advantage ofnatural opportunities for RCTsSometimes constraints on policydelivery provide the ideal context fora policy trial. For example, financialconstraints and/or practicalities maymean that a staggered roll-out is thepreferred option. As long as there isa facility to monitor outcomes in allthe areas which will eventuallyreceive the policy intervention, andthere is a willingness to randomlydecide which area goes first, astaggered policy roll out can beexploited to run a ʻstepped-wedgeʼdesign trial.For example, the probation service inthe Durham area wanted to test out anew approach to delivering theprobation service. Resourceconstraints precluded all 6 probationcentres receiving the new guidanceand training at the same time. Thefairest, and scientifically most robust,approach was to randomly assign the6 centres to a position in a waitinglist. All centres eventually receivedthe training but because randomallocation rather than administrativeconvenience determined when eachcentre received the training, a robustevaluation of the effects of the newservice on reoffending rates could beconducted.23

Page 23: TLA-1906126

meaningfully used to analyse howmuch of the variation is simply due tochance.When deciding on an outcomemeasure, it is also important to identifyan outcome that you really care about,or as close as you can get to it, ratherthan a procedural measure that ishalfway there. For example, in a trialto see whether probation officersreferring to alcohol services canreduce re-offending, you mightmeasure: alcohol service referrals,alcohol service attendances, alcoholintake by questionnaire, or re-offending.In this case re-offending is theoutcome we care about the most, butthat data might be harder to collectand any benefit on offending mighttake years to become apparent.Because of this, you could considermeasuring alcohol service attendance,as a “surrogate outcome” for the realoutcome of offending behaviour.Alternatively, you might measure both:service attendance, to give interimfindings; and then long-term follow-upresults on offending 24 months later.“Referrals by probation officers” wouldbe the easiest thing to measure, andalthough immediate, it is not ultimatelyvery informative if re-offending is whatwe really care about. See Box 11 foran example.The question of which outcomemeasure to use often benefits fromcollaborative discussion betweenacademics (who know what wouldwork best technically in a trial) andpolicymakers (who know what kind ofdata is conveniently available, andwhat it might cost to collect).

Box 11: The case for (and against)using surrogate outcomes.A surrogate outcome is one which isa proxy for the true outcome ofinterest; for example, reconvictionrates are used as a surrogate forreoffending rates, because they arefar easier to measure (as peoplemight be never caught for the crimesthey commit). The case for using asurrogate outcome is strongestwhere there is good evidence that itis a strong predictor of the ultimateoutcome of interest. Unfortunately,using self-reported measures ofbehaviour change, while easy tomeasure, can be a poor index ofactual behavioural change. Due to“social desirability” biases, peoplemay be motivated to over-report, forexample, the amount of exercisethey do, after they have taken part ina “get fit” programme.If surrogate outcomes are neededbecause the final outcomes are verylong term, it is always worthwhilefollowing up these long termoutcomes to verify the interimresults. There are numerous cases inclinical medicine where initial trialsusing surrogate outcomes weremisleading. For example, offeringpatients with osteoporosis fluoridetreatment was thought to be effectiveas it led to increased bone density.As one of the key clinical indicatorsof osteoporosis, bone density wasjudged an appropriate surrogateoutcome. However, it has beendemonstrated that fluoride treatmentin fact leads to an increase in sometypes of fractures, the ultimateoutcome osteoporotic patients arekeen to avoid..24

23

Page 24: TLA-1906126

Step 3: Decide on therandomisation unit

After deciding what outcome we aregoing to measure (Step 2), we need todecide who or what we are going torandomise. This is known as therandomisation unit.The randomisation unit is most oftenindividual people, for example whenindividuals are randomly assigned toreceive one of two medical treatments,or one of two educationalprogrammes. However, therandomisation unit can also be agroup of people centred around aninstitution, especially if the interventionis something that is best delivered to agroup. For example, whole schoolsmight be randomly assigned to delivera new teaching method, or the currentone; whole Job Centres might berandomly assigned to offer a newtraining programme, or the currentone. Lastly, the randomisation unitcould be a whole geographical area:for example, local authorities might berandomly assigned to deliver one oftwo new health preventionprogrammes or different methods ofwaste recycling (see Box 12).At the end of the trial, outcomes canbe measured in individuals, or for thewhole randomisation unit, dependingon what is practical, and mostaccurate. For example, althoughwhole classes might be randomlyassigned to receive different teachingmethods, the learning outcomes ofindividual students can be assessedwhen calculating the results, forgreater accuracy.

The question as to whether therandomisation unit should beindividuals, institutions or areas willusually depend upon practicalconsiderations. In clinical trials, forexample, it will usually be possible togive different individuals either aplacebo or the drug which is beingtested. But in public policy trials, it may

24 Test, Learn, Adapt

Box 12. Capitalising on localvariations in policyLocal authorities are well placed totest new policies in the field. Bycollaborating with other localauthorities to trial different policies,or by randomly assigning differentstreets or regions to differentinterventions, local authorities canuse the RCT methodology todetermine what works.An example of this approach is thetrial conducted by the local authorityof North Trafford to compare differentmethods of promoting wasterecycling. The randomisation unit inthis trial was “whole streets”. Half ofthe streets in one part of the localauthority were randomly assigned tobe canvassed to encourage them torecycle their waste. Recycling rateswere higher in this group, comparedwith the almost 3000 householdswho did not receive the canvassing.The increase over the short term was5%, and the academic partnersjudged that the canvassing campaigncost around £24 for every additionalhousehold who started recycling.25

Based on this information, the localauthority was then in a position todetermine whether the reducedlandfill costs associated with thecanvassing campaign could justifythe costs of offering it more widely.

Page 25: TLA-1906126

not always be possible to do so. Belowwe consider two examples of differentways in which the Behavioural InsightsTeam has decided upon what therandomisation unit should be:. Individual: When considering differentmessages in tax letters, it is obviouslypossible to send different letters out todifferent individuals, so therandomisation unit was individual taxdebtors.. Institution: When running a trial onsupporting people to get into work in JobCentres, it is not possible to randomlyassign different interventions to differentjob seekers, so the randomisation unitwill be Job Centre teams (i.e. teams ofthe advisors who help the job seekers).

As with other steps, it will be useful todiscuss the randomisation unit with anacademic advisor. It will also beimportant to consider how the decisionto choose a particular unit interactswith other considerations. Mostimportantly, it will affect how manypeople will need to be involved in thetrial: having institutions or areas asyour unit of study will nearly alwaysmean that a larger sample ofindividuals is required, and specialmethods of analysis are also needed.There can also be otherconsiderations: for example, in anevaluation of attendance incentives foradult education classes, theresearchers chose to randomise wholeclasses, even though it would havebeen possible to randomise individualattendees. This was to avoid resentfuldemoralisation from those in the groupwithout incentives, who would see thatother learners in their class werereceiving an incentive and they werenot. This may have negatively affectedtheir attendance rate to the classes,and we might have seen an effect due

Box 13. When the randomisationunit should be groups rather thanindividualsWorms like hookworm infect almostone quarter of the worldʼs population,mostly in developing countries. It is acommon cause of school absence,and US researchers collaboratedwith the US Ministry of Health todetermine whether offering childrendeworming treatment would reduceschool absenteeism.An RCT was conducted in whichentire schools either received massdeworming treatment or continued asusual without it. In this case,individual randomisation would havebeen inappropriate – if randomisationhad occurred within schools, suchthat some pupils were dewormedand others were not, the likelihoodthat the control participants wouldcontract an infection may have beenartificially reduced by the fact theirpeers were worm-free.Seventy-five primary schools in ruralKenya took part in the study, whichdemonstrated that the dewormingprogramme reduced absenteeism byone quarter.26 Increases in schoolattendance were particularly markedin the youngest children. This studydemonstrated that an additional yearof school attendance could beachieved by deworming at cost of$3.50 per student, representing ahighly cost-effective method toincrease school participation (otherprogrammes, such as free schooluniforms, cost over $100 per studentto deliver similar effects).3

25

Page 26: TLA-1906126

to this problem rather than due to theincentive.In addition, it is crucially important thatindividuals are recruited to the studybefore the randomisation is done,otherwise the trial ceases to be robust.For example, if the people running atrial know which group a potentialparticipant would be allocated to,before that participant is formallyrecruited into the study, then this mayaffect the decision to recruit them atall. A researcher or frontline staffmember who believes passionately inthe new intervention may choose –maybe unconsciously – not to recruitparticipants who they believe are “nohopers” into the new interventiongroup. This would mean that theparticipants in each “random” groupwere no longer representative. Thiskind of problem can be avoided bysimply ensuring that participants areformally recruited into the trial first,and then randomised afterwards.

Step 4: Determine how manyunits are required for robustresults

To draw policy conclusions from anRCT, the trial must be conducted witha sufficient sample size. If the samplesize is large enough, we can be surethat the effect of our intervention isunlikely to be due to chance.If we have decided that therandomisation unit will be institutionsor areas, it is very likely that we willneed a larger number of people in thetrial than if we had decided torandomise by individual. Simple

preliminary “power calculations” willhelp determine how many units(individuals, institutions etc.) should beincluded in the policy intervention andcontrol groups. We recommendworking with academics who haveexperience in RCTs to ensure this keytechnical calculation is done correctly.If your policy intervention delivers ahuge benefit (a large “effect size”), youwill be able to detect this using a trialwith a relatively small sample size.Detecting more subtle differences(small effect sizes) betweeninterventions will require largernumbers of participants, so itimportant from the outset not to beoverly optimistic about the likelysuccess of an intervention. Manyinterventions - if not most - haverelatively small effects.As an example of how manyparticipants are needed for a trial: ifwe randomly allocated 800 people intotwo groups of 400 each this wouldgive us about an 8 out of 10 chance ofseeing a difference of 10%, if such adifference existed.For example, imagine that thegovernment wants to encouragepeople to vote, and wants to test theeffectiveness of sending textmessages to registered voters on themorning of an election to remind them.They choose 800 voters to observe:400 in the control group who willreceive no extra reminder, and 400 inthe treatment group, who will receivetext messages. If turnout is 50% in thecontrol group, with a sample of thissize we would have an 80% chance ofseeing a change from 50% to 60% (a10 percentage point change). If wewanted to detect a smaller difference,we would need larger sample sizes.

26 Test, Learn, Adapt

Page 27: TLA-1906126

Some consideration should be givento how much it costs to recruit eachadditional person and the impact(effect size and potential cost savings)of the intervention that is beingmeasured. Sometimes detecting evena modest difference is very useful,particularly if the intervention itselfcosts little or nothing. For example, ifwe are changing the style or content ofa letter to encourage the promptpayment of tax, then the additionalcost is very small, as postage costsare incurred anyway and we arealready collecting the outcome data (inthis case, payment dates). In contrast,if we wanted to increase the proportionof people who are on Job SeekersʼAllowance getting a full time job bygiving them one to one job relatedcounselling, then this is relativelyexpensive, and we would hope to seea commensurately larger effect for it tobe worthwhile running a trial. However,even for expensive interventions, ifhypothesised impacts are small interms of effect size, but potentiallylarge in terms of savings (e.g.reductions in the number of peopleclaiming benefits), there may be astrong case for conducting an RCT.

Step 5: Assign each unit toone of the policyinterventions, using a robustrandomisation method

Random allocation of the units ofstudy into policy intervention andcontrol groups is the key step thatmakes the RCT superior to other typesof policy evaluation: it enables us to beconfident that the policy interventiongroup and control group are equivalent

with respect to all key factors. In thecontext of education policy, forexample, these might includesocioeconomic status, gender, andprevious educational attainment.There are various ways that bias cancreep in during the randomisationprocess, so it is important to ensurethis step is done correctly from theoutset, to avoid problems further downthe line.There is a lot of evidence that peoplewho have a vested interest in a studymay try to allocate people in a non-random manner, albeit unconsciously.For example, if a trial is allocatingpeople to a “back-to-work” interventionon the basis of their NationalInsurance numbers, and odd numbersget the new intervention, then theperson recruiting the participant mayconsciously or unconsciously excludecertain people with an odd NI numberfrom the trial altogether, if they suspectthey will not do well, in their desire tomake the new intervention look good.This introduces bias into the trial, andso the method of randomisation mustbe resistant to such interference.There are many independentorganisations, such as clinical trialsunits, who can help to set up a securerandomisation service to avoid thisproblem of “poor allocationconcealment”. Typically this willinvolve a random number generatorthat determines which group aparticipant will be allocated to, andonly after they have been formallyrecruited into the trial (for the reasonsdescribed above).At the time of randomisation, if it is feltto be important, then steps can alsobe taken to ensure that the groups are

27

Page 28: TLA-1906126

evenly balanced with respect tovarious different characteristics: forexample, to make sure that there isroughly the same age and sexdistribution in each group. This isparticularly important in smaller trialsas thees have less power.

Step 6: Introduce the policyinterventions to the assignedgroups

Once individuals, institutions orgeographical areas have beenrandomly allocated to either atreatment group or a control, then it istime to introduce the policyintervention.This might involve, for example,introducing a new type of educationpolicy to a group of schools, while notmaking the corresponding changeselsewhere. When the BehaviouralInsights Team ran a trial looking atwhether text messages might improvepeopleʼs propensity to pay their Courtfines, for example, individuals in theintervention groups received one ofseveral different types of textmessage, whereas those in the controlgroup received no text.One important consideration at thisstage is to have a system in place formonitoring the intervention, to ensurethat it is being introduced in the waythat was originally intended. In the textmessage example, for instance, it wasuseful to ensure that the right textswere going to the right people. Theuse of a process evaluation to monitorthat the intervention is introduced asintended will ensure that results are as

28 Test, Learn, Adapt

Box 15: Building in variations toenable testingTesting involves comparing the effectof one intervention (e.g. possiblenew policy) against another (e.g.present policy). A sound testobviously requires that variations onthe policy (e.g. new and present) canbe delivered simultaneously. In somecases this is quite simple – someschools might continue to serve theusual school meals, while otherscould provide meals adhering to newnutritional standards, and the effecton classroom behaviour could bemeasured. In other cases, thesystems in place may make it difficultto offer different policy variations atthe same time.For example, although a localauthority may wish to test theeffectiveness of simplifying a claimform, their letter systems may beoutsourced, and/or incapable ofprinting more than one lettertemplate. For this reason, westrongly suggest that whendeveloping new systems or procuringnew contracts from serviceproviders, policy makers ensure thatthey will be able to deliver policyvariations in the future. Although thismay come at a slight upfront cost,the ability to test different versions ofpolicies in the future is likely to morethan justify this. With precisely this inmind, DWP legislation specificallyallows the IT systems which deliverUniversal Credit to include the facilityto provide variations, to ensure thatthe department is capable of testingto find out what works, and adaptingtheir services to reflect this.

Page 29: TLA-1906126

meaningful as possible and earlyhiccups can be rectified.As with other steps, however, it will beimportant to ensure that it is possibleto evaluate the trial in a way thatreflects how it is likely to be rolled outif and when it is scaled up. Forexample, in the text message trial, itemerged that we did not always havethe correct mobile numbers foreveryone in the groups.It would have been tempting to spendadditional time and money to checkand chase these additional telephonenumbers, but it would not havereflected how the intervention mighthave been introduced if it were to bescaled up and would have thereforemade the results appear moresuccessful than they would be in “reallife”.

29

Page 30: TLA-1906126

Step 7: Measure the resultsand determine the impact ofthe policy interventions

Once the intervention has been introducedwe need to measure outcomes. The timingand method of outcome assessmentshould have been decided beforerandomisation. It will depend upon howquickly we think the intervention will work,which will differ for each intervention.A trial of different letters to peopleencouraging them to pay their fines mayonly need several weeksʼ follow-up, whilsta curriculum intervention may need aschool term or even several years.In addition to the main outcome, it may beuseful to collect process measures. Forexample, in a study of differing probationservices, one might collect data onreferrals to different agencies to helpexplain the results. In this instance, areduction in reoffending might beaccompanied by a corresponding increasein referrals to anger management classesthat might help explain the results. Thesesecondary findings cannot be interpretedwith the same certainty as the main resultsof the trial, but they can be used to developnew hypotheses for further trials (see Box16).In addition, many trials also involve thecollection of qualitative data to help explainthe findings, support futureimplementation, and act as a guide forfurther research or improving theintervention. This is not necessary, but ifqualitative research is planned anyway, itis ideal to do it in relation to the sameparticipants as those in the trial, sincethere may then be more informationavailable.

Learn30 Test, Learn, Adapt

Box 16. Smarter use of dataWe are often interested in whether apolicy intervention is broadly effective fora representative sample of the generalpopulation. In some cases, however, wemight be interested to find out whethersome groups (e.g. men and women,young and elderly people) responddifferently to others. It is important todecide at the outset if we are interestedin segmenting the sample in this way –if we do this after the data has beencollected, sub-group analyses run ahigh risk of lacking both statistical powerand validity. However, should a sub-group trend arise unexpectedly (e.g.men might seem to be more responsivethan women to text message remindersto attend GP appointments), we mightconsider conducting a trial in the futureto find out whether this is a robust result.It is usually worthwhile collectingadditional data (e.g. age, gender) whichwill help you to segment your sampleand inform future research.Sometimes, an unanticipated trend mayemerge from your trial data. Forexample, you might notice largefluctuations over time in theeffectiveness of an incentive to take uploft insulation, and discover that this isrelates to temperature variations. Thistrend might suggest that people aremore receptive to messages abouthome insulation when the weather iscold. As it is an unplanned analysis, theresults canʼt be considered definitive;however, no information should bewasted, and this result could bevaluable to inform future research.

Page 31: TLA-1906126

Step 8: Adapt your policyintervention to reflect yourfindings

Implementing positive results aboutinterventions is often easier thanconvincing people to stop policies thathave been demonstrated to beineffective. Any trial that is conducted,completed, and analysed, should bedeemed successful. An RCT thatshows no effect or a harmful effectfrom the new policy is just as valuableas one that shows a benefit.The DWP trial of supporting peoplewho were receiving sickness benefitwas a “null” study in that it did notdemonstrate effectiveness (see Box2). However, if we can be confidentthat this was a fair test of whether theintervention works (which should beestablished before a trial commences),and that the sample size was largeenough to detect any benefit ofinterest (which, again, should beestablished before commencing), thenwe have learnt useful information fromthis trial.Where interventions have been shownto be ineffective, then “rationaldisinvestment” can be considered, andthe money saved can be spentelsewhere, on interventions that areeffective. Furthermore, such resultsshould also act as catalysts to findother interventions which are effective:for example, other interventions tohelp people on sickness benefits.

When any RCT of a policy iscompleted it is good practice topublish the findings, with fullinformation about the methods of thetrial so that others can assess whetherit was a “fair test” of the intervention. Itis also important to include a fulldescription of the intervention and theparticipants, so that others canimplement the programme withconfidence in other areas if they wishto.A useful document that can guide thewriting of the RCT report is theCONSORT statement27, which is usedin medical trials and also, increasingly,in non-medical trials. Following theCONSORT guidance will ensure thekey parts of the trial and theinterventions are sufficiently accuratelydescribed to allow reproduction of thetrial or implementation of theintervention in a different area.Ideally, the protocol of the trial shouldbe published before the trialcommences, so that people can offercriticisms or improvements before thetrial is running. Publishing the protocolalso makes it clear that the mainoutcome reported in the resultsdefinitely was the outcome that waschosen before the trial began.

Adapt31

Page 32: TLA-1906126

Step 9: Return to Step 1 tocontinually improve yourunderstanding of what works

Rather than seeing an RCT as a toolto evaluate a single programme at agiven point in time, it is useful to thinkof RCTs as part of a continual processof policy innovation and improvement.Replication of the results of a trial isparticularly important if the interventionis to be offered to a differentpopulation segment than that wasinvolved in the original RCT. It is alsouseful to build on trial findings toidentify new ways of improvingoutcomes.be particularly pertinentwhen RCTs are used to identify whichaspects of a policy are having thegreatest impact. In recent work withHMRC, for example, the BehaviouralInsights Team has been attempting tounderstand which messages are mosteffective at helping people to complywith the tax system.Several early lessons have beenlearnt about what works best – forexample, keeping forms and letters assimple as possible and informingdebtors that most others in their areashave already paid their tax.However, rather than banking theselessons and assuming that perfectionhas been achieved, it is more useful tothink of the potential for furtherrefinement: are there, for example,other ways that we can find to simplifyforms and make it easier for taxpayersto comply, or are there othermessages that might resonate withdifferent types of taxpayer?The same type of thinking can apply toall areas of policy – from improvingexamination results to helping people

into sustainable employment.Continual improvement, in this sense,is the final, but arguably mostimportant, aspect of the ʻtest, learn,adaptʼ methodology as it assumes thatwe never know as much as we coulddo about any given area of policy.

32 Test, Learn, Adapt

Box 17. Reducing patient mortalityin nursing homesFlu vaccinations are routinely offeredto at risk groups, including theelderly, as the flu seasonapproaches. In nursing homeshowever, the flu virus is likely to beintroduced into the home via staff. AnRCT was conducted in 2003 todetermine whether the cost of a driveto vaccinate staff would a) increasestaff vaccination rates, and b) havepositive effects on patient health.Over 40 nursing homes wererandomly allocated either to continueas usual (without a staff vaccinationdrive) or to put in place a campaignto raise staff awareness of fluvaccines and offer appointments forinoculation. Over two flu seasons,staff uptake of vaccines wassignificantly higher in nursing homeswho instituted the staff flu campaign,perhaps unsurprisingly. Mostimportantly, the all-cause mortality ofresidents was also lower, with 5fewer deaths for every 100residents.28 This research contributedto a national recommendation tovaccinate staff in care home settings,and is cited as part of the justificationfor continued recommendations tovaccinate healthcare workersinternationally.

Page 33: TLA-1906126

1. The first published RCT in medicine is credited to Sir A. Bradford Hill, an epidemiologist for England's MedicalResearch Council. The trial, published in the British Medical Journal in 1948, tested whether streptomycin is effectivein treating tuberculosis.

2. Banerjee, A., Duflo, E. (2011). Poor economics: A radical rethinking of the way to fight global poverty PublicAffairs:New York.

3. Karlan, D. & Appel, J. (2011). More than good intentions: How a new economics is helping to solve global poverty.Dutton: New York.

4. Shepherd, J. (2007). The production and management of evidence for public service reform. Evidence and Policy,Policy Press Vol. 3 (2) pages 231-251

5. Department of Work and Pensions, Research Report 342, 2006. Impacts of the Job Retention and RehabilitationPilot. http://research.dwp.gov.uk/asd/asd5/rports2005-2006/rrep342.pdf

6. This is an interaction design, which allows the determination of the separate and combined effects of twointerventions. Such designs are especially useful where questions exist about the additional effect of one/morefeatures of a complex programme.

7. Department of Work and Pensions, Research Report 382, 2006. Jobseekers Allowance intervention pilotsquantitative evaluation. http://research.dwp.gov.uk/asd/asd5/rports2005-2006/rrep382.pdf

8. Harford, T. (2011). Adapt: Why success always starts with failure. Little, Brown: London.

9. Taleb, N. N. (2007). The Black Swan: The impact of the highly improbable. Allen Lane: London.

10. Christensen, C. (2003). The Innovator's Dilemma: The revolutionary book that will change the way you dobusiness. HarperBusiness: New York.

11. Luca, M. (2011). Reviews, reputation, and revenue: The case of Yelp.com. Harvard Business School WorkingPaper, No. 12-016.

12. Banerjee, A. V., Cole, S., Duflo, E. & Linden, L. (2007). Remedying education: Evidence from two randomisedexperiments in India. Quarterly Journal of Economics, MIT Press, vol. 122(3), pages 1235-1264

13. Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. Harvard BusinessSchool Press,

14. Delta Airlines Magazine (2007), 0915, 22.

15. Edwards,. P. et al. (2005). Final results of MRC CRASH, a randomised placebo-controlled trial of intravenouscorticosteroid in adults with head injury – outcomes at 6 months. Lancet, 365, 1957-1959.

16. This can be estimated formally: by comparing the cost of a trial, for example, against an estimate of the moneythat would be wasted if the intervention was implemented but had no benefit.

17. Finckenauer J. O. (1982) Scared Straight and the Panacea Phenomenon. Englewood Cliffs, NJ: Prentice-Hall,1982.

18. Petrosino, A., Turpin-Petrosino, C., & Buehler, J. (2003). Scared Straight and other juvenile awareness programsfor preventing juvenile delinquency. Campbell Review Update I. The Campbell Collaboration Reviews of Interventionand Policy Evaluations (C2-RIPE). Philadelphia, Pennsylvania: Campbell Collaboration.

19. The Social Research Unit (2012). Youth justice: Cost and benefits. Investing in Children, 2.1 (April). Dartington:The Social Research Unit. Retrieved from http://www.dartington.org.uk/investinginchildren

20. Brooks, G., Burton, M., Cole, P., Miles, J., Torgerson, C. & Torgerson, D. (2008). Randomised controlled trial ofincentives to improve attendance at adult literacy classes. Oxford Review of Education, 34(5), 493-504.

21. For a summary of the US research on the Family Nurse Partnership, see: MacMillan, H. L. et al. (2009).Interventions to prevent child maltreatment and associated impairment. Lancet, 363 (9659), 250-266.

References33

Page 34: TLA-1906126

22. http://www.campbellcollaboration.org/library.php

23. Final results yet to be published. For details on the study design, see: Pearson, D., Torgerson, D., McDougall, C.,& Bowles, R. (2010). A parable of two agencies, one of which randomises. Annals of the American Academy ofPolitical & Social Sciences, 628, 11-29.

24. Riggs, B. L., Hodgson, S. F., OʼFallon, W. M. (1990). Effect of fluoride treatment on fracture rate inpostmenopausal women with osteoporosis. New England Journal of Medicine, 322, 802–809; Rothwell, P. M. (2005).External validity of randomised controlled trials:“To whom do the results of this trial apply?” Lancet, 365, 82-93

25. Cotterill, S. John, P., Liu, H., & Nomura, H. (2009). How to get those recycling boxes out:Arandomised controlled trial of a door todoor recycling service; John, P., Cotterill, S., Richardson, L., Moseley,A., Smith, G.,Stoker, G, & Wales, C. (2011). Nudge, nudge,think, think: Using experiments to change civic behaviour. London: BloomsburyAcademic.

26. Miguel, E. & Kremer, M. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities.Econometrica, 72, 159-217.

27. http://www.consort-statement.org/consort-statement

28. Hayward,A. et al. (2006). Effectiveness of influenza vaccine programme for care home staff to prevent death, morbidity andhealth service use among residents; cluster randomised control trial. British Medical Journal, 333 (7581), 1241-1247.

34 Test, Learn, Adapt

Page 35: TLA-1906126

Published by the Cabinet Office BehaviouralInsights Team

Publication date: June 2012

© Crown copyright June 2012

You may reuse this information (not includinglogos) free of charge in any format or medium,under the terms of the Open Government Licence.

To view this licence, visitwww.nationalarchives.gov.uk/ doc/open-government-licence/

or write to the Information Policy Team, TheNational Archives, Kew, London TW9 4DU, oremail [email protected]

This document can also be viewed on our websiteat www.cabinetoffice.gov.uk


Recommended