+ All Categories
Home > Documents > Determining disease prevalence from incidence and survival using simulation techniques

Determining disease prevalence from incidence and survival using simulation techniques

Date post: 30-Dec-2016
Category:
Upload: eve
View: 217 times
Download: 2 times
Share this document with a friend
7
Determining disease prevalence from incidence and survival using simulation techniques Simon Crouch *, Alex Smith, Dan Painter, Jinlei Li, Eve Roman Epidemiology & Cancer Statistics Group, Department of Health Sciences, University of York, YO10 5DD, UK 1. Introduction Estimation of disease prevalence is of fundamental interest in epidemiology [1]. As observed by Gigli et al. [2], three main methods of estimation of prevalence are commonly employed: cross-sectional population survey; direct count of cases in a disease register; and mathematical modelling based on incidence and survival rates. Gigli et al. [2] illustrate a method combining the latter two approaches in three steps: step 1 counts surviving cases at an index date from incident cases in a registry; step 2 estimates the number of prevalent cases lost-to-follow-up from the registry count; and step 3 estimates the number of prevalent cases at the index date that were incident before the start of the registry. Steps 1 and 2 together are often referred to as the ‘‘counting method’’ and step 3 as the ‘‘completeness index method’’. Previous approaches to steps 2 and 3 of this schema have focussed on various analytic techniques of estimation [3–6], themselves based on the relationships between the various measurable quantities [7–9], or by direct modelling [10]. These techniques have found wide application in the literature [11–13]. Consideration of the precision of prevalence estimates has focussed on the variation implied by considering the incidence process as Poisson [2,14]. In this paper we will consider techniques of estimation for steps 2 and 3 based entirely on simulation. We will illustrate our techniques using data drawn from a population based cohort of patients diagnosed with haematological malignancies; in particu- lar we will provide prevalence estimates for acute myeloid leukaemia (AML). We first define what we mean by ‘‘prevalence’’. Broadly speaking the prevalence of a disease in a population is the number or proportion of the population alive at some index date, Cancer Epidemiology 38 (2014) 193–199 A R T I C L E I N F O Article history: Received 21 August 2013 Received in revised form 14 January 2014 Accepted 17 February 2014 Available online 18 March 2014 Keywords: Prevalence Incidence Survival Simulation Monte-Carlo Prevalence distribution A B S T R A C T Objectives: We present a new method for determining prevalence estimates together with estimates of their precision, from incidence and survival data using Monte-Carlo simulation techniques. The algorithm also provides for the incidence process to be marked with the values of subject level covariates, facilitating calculation of the distribution of these variables in prevalent cases. Methods: Disease incidence is modelled as a marked stochastic process and simulations are made from this process. For each simulated incident case, the probability of remaining in the prevalent sub- population is calculated from bootstrapped survival curves. This algorithm is used to determine the distribution of prevalence estimates and of the ancillary data associated with the marks of the incidence process. This is then used to determine prevalence estimates and estimates of the precision of these estimates, together with estimates of the distribution of ancillary variables in the prevalent sub- population. This technique is illustrated by determining the prevalence of acute myeloid leukaemia from data held in the Haematological Malignancy Research Network (HMRN). In addition, the precision of these estimates is determined and the age distribution of prevalent cases diagnosed within twenty years of the prevalence index date is calculated. Conclusion: Determining prevalence estimates by using Monte-Carlo simulation techniques provides a means of calculation more flexible that traditional techniques. In addition to automatically providing precision estimates for the prevalence estimates, the distribution of any measured subject level variables can be calculated for the prevalent sub-population. Temporal changes in incidence and in survival offer no difficulties for the method. ß 2014 Elsevier Ltd. All rights reserved. * Corresponding author at: Epidemiology & Cancer Statistics Group, Department of Health Sciences, Seebohm-Rowntree Building, University of York, YO10 5DD, UK. Tel.: +44 01904 321938. E-mail address: [email protected] (S. Crouch). Contents lists available at ScienceDirect Cancer Epidemiology The International Journal of Cancer Epidemiology, Detection, and Prevention jou r nal h o mep age: w ww.c an cer ep idem io log y.n et http://dx.doi.org/10.1016/j.canep.2014.02.005 1877-7821/ß 2014 Elsevier Ltd. All rights reserved.
Transcript

Cancer Epidemiology 38 (2014) 193–199

Determining disease prevalence from incidence andsurvival using simulation techniques

Simon Crouch *, Alex Smith, Dan Painter, Jinlei Li, Eve Roman

Epidemiology & Cancer Statistics Group, Department of Health Sciences, University of York, YO10 5DD, UK

A R T I C L E I N F O

Article history:

Received 21 August 2013

Received in revised form 14 January 2014

Accepted 17 February 2014

Available online 18 March 2014

Keywords:

Prevalence

Incidence

Survival

Simulation

Monte-Carlo

Prevalence distribution

A B S T R A C T

Objectives: We present a new method for determining prevalence estimates together with estimates of

their precision, from incidence and survival data using Monte-Carlo simulation techniques. The

algorithm also provides for the incidence process to be marked with the values of subject level

covariates, facilitating calculation of the distribution of these variables in prevalent cases.

Methods: Disease incidence is modelled as a marked stochastic process and simulations are made from

this process. For each simulated incident case, the probability of remaining in the prevalent sub-

population is calculated from bootstrapped survival curves. This algorithm is used to determine the

distribution of prevalence estimates and of the ancillary data associated with the marks of the incidence

process. This is then used to determine prevalence estimates and estimates of the precision of these

estimates, together with estimates of the distribution of ancillary variables in the prevalent sub-

population. This technique is illustrated by determining the prevalence of acute myeloid leukaemia from

data held in the Haematological Malignancy Research Network (HMRN). In addition, the precision of

these estimates is determined and the age distribution of prevalent cases diagnosed within twenty years

of the prevalence index date is calculated.

Conclusion: Determining prevalence estimates by using Monte-Carlo simulation techniques provides a

means of calculation more flexible that traditional techniques. In addition to automatically providing

precision estimates for the prevalence estimates, the distribution of any measured subject level variables

can be calculated for the prevalent sub-population. Temporal changes in incidence and in survival offer

no difficulties for the method.

� 2014 Elsevier Ltd. All rights reserved.

Contents lists available at ScienceDirect

Cancer EpidemiologyThe International Journal of Cancer Epidemiology, Detection, and Prevention

jou r nal h o mep age: w ww.c an cer ep idem io log y.n et

1. Introduction

Estimation of disease prevalence is of fundamental interest inepidemiology [1]. As observed by Gigli et al. [2], three mainmethods of estimation of prevalence are commonly employed:cross-sectional population survey; direct count of cases in adisease register; and mathematical modelling based on incidenceand survival rates. Gigli et al. [2] illustrate a method combining thelatter two approaches in three steps: step 1 counts surviving casesat an index date from incident cases in a registry; step 2 estimatesthe number of prevalent cases lost-to-follow-up from the registrycount; and step 3 estimates the number of prevalent cases at theindex date that were incident before the start of the registry. Steps

* Corresponding author at: Epidemiology & Cancer Statistics Group, Department

of Health Sciences, Seebohm-Rowntree Building, University of York, YO10 5DD, UK.

Tel.: +44 01904 321938.

E-mail address: [email protected] (S. Crouch).

http://dx.doi.org/10.1016/j.canep.2014.02.005

1877-7821/� 2014 Elsevier Ltd. All rights reserved.

1 and 2 together are often referred to as the ‘‘counting method’’ andstep 3 as the ‘‘completeness index method’’.

Previous approaches to steps 2 and 3 of this schema havefocussed on various analytic techniques of estimation [3–6],themselves based on the relationships between the variousmeasurable quantities [7–9], or by direct modelling [10]. Thesetechniques have found wide application in the literature [11–13].Consideration of the precision of prevalence estimates hasfocussed on the variation implied by considering the incidenceprocess as Poisson [2,14].

In this paper we will consider techniques of estimation for steps2 and 3 based entirely on simulation. We will illustrate ourtechniques using data drawn from a population based cohort ofpatients diagnosed with haematological malignancies; in particu-lar we will provide prevalence estimates for acute myeloidleukaemia (AML).

We first define what we mean by ‘‘prevalence’’. Broadlyspeaking the prevalence of a disease in a population is the numberor proportion of the population alive at some index date,

Fig. 1. Types of prevalence. Complete prevalence includes diagnosed cases from any

time before the index date still in the prevalent population. n-year prevalence

includes cases diagnosed within the last n years. nth year prevalence includes cases

diagnosed during the single year between (n � 1) and n years before the index date.

Table 1Characteristics of the AML patient cohort.

Incidence dataset Survival dataset

Incidence dates 01/09/2005–31/08/2012 01/09/2005–31/08/2011

Number of subjects 1079 934

Male 592 (55%) 517 (55%)

Female 487 (45%) 417 (45%)

Age range (years) 18.7–97.8 19.0–97.8

Median age (IQR) 71.9 (60.0–79.8) 71.7 (60.0–79.3)

Maximum follow-up (days) N/A 2720

Median survival (95% CI) N/A 132 (109–159)

Total follow-up (years) N/A 1390

IQR, inter quartile range.

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199194

previously diagnosed with the disease and not removed from theprevalent disease sub-population between diagnosis and the indexdate (by death or complete cure, for example); this is referred to as‘‘complete prevalence’’. We also define ‘‘n-year prevalence’’ and‘‘nth-year prevalence’’ to refer to those in the prevalent sub-population at the index date having received a diagnosis of thedisease in the previous n years or during the nth year before theindex date respectively. Therefore n-year prevalence is the sum ofkth-year prevalence for k between 1 and n. So, for example, 3-yearprevalence refers to all those in the prevalent sub-population onthe index date diagnosed in the three years before the index date;3rd-year prevalence refers to all those in the prevalent populationon the index date diagnosed during the third year before the indexdate (Fig. 1). For simplicity of presentation in this paper, we assumethat the only removal mechanism from the prevalent sub-population is death.

The relevance of n-year and nth-year prevalence for differentvalues of n depends upon the use made of the estimates and uponthe disease under consideration. n-year and nth-year prevalenceestimates for small values of n will typically correspond to periodsof intense treatment for acute diseases; for nth-year prevalence,larger values of n will correspond to periods of long termmonitoring. For chronic diseases, all values of n are typically ofinterest.

In this paper we present a new method of determiningprevalence based on the computationally intensive method ofsimulation. The advantages of this new method over existingmethodology are that it naturally allows for the estimation of theprecision of prevalence estimates and also allows for theestimation of ancillary information about the prevalent sub-population (for example, it allows for the estimation of the agedistribution of the prevalent sub-population). In addition morecomplex modelling of incidence and survival functions than isusually allowed for in current techniques provides no additionalobstacle to simulation techniques.

2. Materials

The determination by simulation of prevalence from incidenceand survival estimates derived from a patient cohort is illustratedwith data on patients diagnosed with acute myeloid leukaemia(AML) drawn from the UK’s population-based HaematologicalMalignancy Research Network (HMRN) [15]. Initiated in 2004, andcovering a population of 3.6 million, this unique patient cohort wasestablished to provide robust generalizable data to inform clinicalpractice and research. Comprehensive information about HMRN isavailable elsewhere [15] but briefly, all patients newly diagnosedwith a haematological malignancy residing in the HMRN region(>2000 patients a year) have full-treatment, response andoutcome data collected to clinical trial standards. HMRN hasSection 251 support under the NHS Act 2006; enabling the Healthand Social Care Information Centre (HSCIC) to routinely link to and

release nationwide information on deaths, subsequent cancerregistrations, and Hospital Episode Statistics (HES). Loss-to-follow-up rates are very low in this registry, thanks to this comprehensivedata linkage. In fact, for the small number of subjects that are lost-to-follow-up (by emigration from the UK, for example), the actualdate of loss is known with precision in this registry. Thedemographic structure of the region is similar to the demographicstructure of the UK as a whole, allowing for reliable generalizationfrom this population to the population of the UK.

Incidence data on patients, 18 years and older, diagnosed withAML was available from the HMRN registry for seven years from01/09/2005 to 31/08/2012. The index date for the calculations ofprevalence was taken to be 31/08/2011 and years are taken to runfrom the first of September to the thirty-first of August. Survivaloutcome data was available up until 26/03/2013 for patientsdiagnosed between 01/09/2005 and 31/08/2011. Characteristics ofthese patients are shown in Table 1.

2.1. Methods

We can estimate the number of prevalent cases of a disease at aparticular index date by combining information on incidence andsurvival. An incident case at time t before the index date,characterized by a vector of explanatory variables for survival x,will contribute S(x, t) to the expected number of prevalent cases atthe index date, where S(x, t) is the survival function conditional onexplanatory variables x. Therefore, if there are n cases incident attimes {t1, t2, . . ., tn} each with corresponding survival explanatoryvariables {x1, x2, . . ., xn}, then the expected number of prevalentcases at the index date T0 is given by

P ¼Xn

i¼1

Sðxi; T0 � tiÞ

In this paper we take prevalent cases to be those that have everbeen diagnosed with the disease under consideration. This caneasily be generalized so that prevalence refers to subjects that havenot been removed from the prevalent sub-population by othermeans (such as cure) by taking the end-point for the survivalfunction S to be time to removal from prevalent population ratherthan simply time to death. Complete prevalence takes the sum overall time before the index date; this generalizes to estimation of n-year and nth-year prevalence by restricting the sum to coverincident cases from the corresponding time period.

The value of P can be calculated by simulation. If the times andassociated survival explanatory variables of incident cases can beappropriately modelled, and if survival conditional onthe explanatory variables can be estimated, then simulationfrom the incidence model, together with the survival function, willprovide an estimate of P. What is more, sources of variation can betaken into account, so that calculations of P from repeated randomdraws from the incidence model will provide an estimate of the

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199 195

precision of the estimate of P. Additionally, the distribution ofsurvival explanatory variables, and other ancillary data, in theprevalent sub-population can also be estimated by ‘‘marking’’ eachsimulated incident case with the values of these data so that thedistribution of these data corresponds to their distribution in theobserved incident cases. The simulation process identifies thosesimulated incident cases considered to be alive at the index dateand the distribution of the variables of interest for such survivorsare calculated.

For example, suppose we are interested in calculating theprevalence, and the age and sex distributions of prevalent cases, fora non-communicable disease in which survival depends on age butnot sex. Suppose, in addition, that incidence data, by year, for thisdisease is available from a registry for k years before a fixed indexdate. Incidence times for such a disease might reasonably bemodelled as arising from a Poisson process. If time is broken up byyear before the index date then we can simulate the number ofincident cases in any particular year by making a single draw froma Poisson distribution with appropriate rate parameter. This rateparameter is itself the value of a random draw from a distributionthat represents the uncertainty in the estimation of the overallincidence rate per year. For a Poisson process this distribution isapproximately normal with mean given by the observed incidencerate R and variance R/k. Suppose that Ri cases are simulated in year i

then, as the distribution of incidence times for a Poisson process isthe uniform distribution [20], the incident case times can besimulated by making a random draw from a uniform distributionover the year with mean Ri. Now, for each simulated incident case,it can be ‘‘marked’’ with values for age and sex by making a singlerandom draw from the joint age and sex distribution of theobserved cases. (Alternatively, the joint age and sex distributioncan be modelled from the observed distribution and a draw madefrom this model.) As the incidence time of each simulated incidentcase is known, the probability of survival for this incident case canbe calculated from a survival model calculated from the observedregistry. The binary survival (yes/no) for each simulated incidentcase can be simulated as a draw from a Bernoulli distribution withprobability equal to the probability of survival. Therefore thesurvivors from the simulated incident cases can themselves besimulated and their age and sex distributions calculated.

Fig. 2. Modelled survival by age. Blue lines show predicted survival from a Weibull survi

survival after the cure time. (For interpretation of the references to colour in this figur

As survival is estimated from a finite quantity of data, thesimulation should take into account the variability in the survivalfunction estimate. This can be done by using a bootstrap samplingstrategy [21]. In the simple bootstrap a sample of size equal to thenumber of cases is drawn with replacement and the survivalfunction is calculated for this sample. This is then repeated manytimes; the variation in the repeated estimates of the survivalfunction captures the uncertainty in the estimate of the survivalfunction. For the prevalence simulation algorithm each simulatedincident case is paired at random with one of the bootstrap survivalfunction estimates in order to calculate its probability of survival atthe index date.

One particular difficulty in the estimation of prevalence(especially if we are interested in surviving subjects who haveever had the disease) is that long term survival may be poorlyestimated. Longer registration periods are clearly most desirablebut if these are not available then we suggest an approach to thisproblem that provides reasonable approximate upper and lowerbounds for survival estimates. For the lower bound, survival ismodelled from an observed cohort using parametric survivalmodelling, allowing for extrapolation beyond observed survival.For an upper bound it is assumed that the disease in question has a‘‘cure time’’ after diagnosis, such that survival characteristics afterthe cure time revert to the general population survival. It ispossible that the extrapolation in the lower bound may be toooptimistic, and that a ‘‘healthy survivor’’ effect renders the upperbound too pessimistic, but in the absence of data beyond theregistry, this is the best that can be hoped for.

So, if we let SO denote estimated survival from the cohort and SP

denote population survival then the survival function S for the‘‘cure’’ model with cure time TC is

Sðx; tÞ ¼ SOðx; tÞ t < TC

SOðx; TCÞ � Spðx0; tÞ t > TC

where x0 denotes that the covariates (which may include, forexample, age) are evaluated at time TC. The effect of this equation isto ‘‘glue’’ the population survival curve onto the survival curveestimated from the cohort (Fig. 2).

val model according to the ages shown. Red lines show the reversion to population

e legend, the reader is referred to the web version of the article.)

Table 2aMeasured incidenceb, prevalenceb and model fitc.

2005–2006 2006–2007 2007–2008 2008–2009 2009–2010 2010–2011 2011–2012

Incidenced

Male 70 (4.05) 80 (4.62) 78 (4.51) 82 (4.74) 95 (5.49) 90 (5.20) 97 (5.61)

Female 61 (3.31) 76 (4.13) 66 (3.58) 56 (3.04) 70 (3.80) 77 (4.18) 81 (4.40)

Measured prevalence

Male 14 (0.81) 12 (0.69) 12 (0.69) 23 (1.33) 27 (1.56) 42 (2.43) N/A

Female 8 (0.43) 16 (0.87) 11 (0.60) 14 (0.76) 17 (0.92) 33 (1.79) N/A

Predicted (no cure)

Male 8.4(0.49) 11.1(0.64) 12.9(0.75) 17.0(0.98) 26.8(1.55) 45.5(2.63) N/A

Female 8.1(0.44) 11.6(0.63) 11.8(0.64) 12.3(0.67) 20.5(1.11) 39.2(2.13) N/A

Predicted (5 year cure)

Male 8.9(0.51) 11.1(0.64) 12.9(0.75) 17.0(0.98) 26.8(1.55) 45.5(2.63) N/A

Female 8.6(0.47) 11.6(0.63) 11.8(0.64) 12.3(0.67) 20.5(1.11) 39.2(2.13) N/A

Predicted (3 year cure)

Male 12.3(0.71) 14.3(0.83) 14.2(0.82) 17.0(0.98) 26.8(1.55) 45.5(2.63) N/A

Female 11.6(0.63) 14.6(0.79) 12.9(0.70) 12.3(0.67) 20.5(1.11) 39.2(2.63) N/A

a Years run from 1st September to 31st August.b Figures given are: number observed or predicted (number per 100,000).c Model fit: no cure model chi-squared = 7.6 on 11 d.f. (p = 0.75); 5 year cure model chi-squared = 7.3 on 11 d.f. (p = 0.78); 3 year cure model chi-squared = 7.1 on 11 d.f.

(p = 0.79).d Incidence cases amongst males = 84.7 per year (=4.90 per 100,000 per year); females = 69.6 (=3.78 per 100,000 per year). Incidence rates are consistent with a

homogeneous Poisson process with these rates (p = 0.34 for males, p = 0.30 for females; p values by simulation).

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199196

These two approaches may be considered to give us upper andlower bounds on survival and therefore prevalence estimates.

We note that this simulation approach deals in terms ofnumbers of cases rather than population proportions. It isstraightforward to generalize from numbers estimated from aparticular sample to proportions for populations. Results areshown in both forms in Tables 2 and 3.

We summarize this description in the following algorithm andthen illustrate the procedure for the cohort of AML patients.

2.2. Algorithm

Preliminary stage (a). Using the registry data, determine anappropriate model for the incidence process for the disease.Estimate the parameters required to specify this model so thatsimulations of incident case times can be made from it. Forexample, a Poisson model for the number of yearly incidentcases, together with a uniform distribution for the incident casetimes, models the incidence process as a Poisson process.Preliminary stage (b). Use survival data from the registry tomodel disease survival using an appropriate model with

Table 3Predicted prevalence.a

5 year prevalence 10 year pre

No cure model

Male numbersb 116 (95, 137) 161 (137, 1

Male proportionsb 6.71 (5.46, 7.95) 9.30 (7.90,

Female numbers 91 (70, 112) 128 (104, 1

Female proportions 4.94 (3.81, 6.08) 6.96 (5.66,

5 year cure model

Male numbers 116 (95, 137) 172 (147, 1

Male proportions 6.71 (5.49, 7.92) 9.95 (8.47,

Female numbers 91 (70, 112) 138 (112, 1

Female proportions 4.94 (3.78, 6.10) 7.47 (6.10,

3 year cure model

Male numbers 116 (94, 138) 188 (160, 2

Male Proportions 6.71 (5.45, 7.96) 10.9 (9.26,

Female numbers 91 (68, 114) 151 (123, 1

Female proportions 4.94 (3.72, 6.16) 8.18 (6.66,

a Index date is 31st August 2011.b Figures given are: estimate (95% confidence interval). Proportions are given as per

explanatory variables suitable for the disease. Calculatesurvival functions for simple bootstrap samples and storethese for use in the prevalence simulation.

Step 1. Divide the time before the index date into k suitable timeperiods, such as years, counting time backwards from theindex date. For each time period, simulate the number andtimes of incident cases during that time period using themodel determined in Preliminary stage (a). For eachsimulated incident case, simulate the associated survivalexplanatory variables and other ancillary variables bymaking a random draw from the observed joint distribu-tion of these variables or from a joint model determinedfrom the observed data.

Step 2. For each simulated incident case m, calculate its expectedprobability of survival pm at the index date, using a randomsurvival function (with appropriate values of explanatoryvariables) from the bootstrap distribution of survivalcurves prepared in Preliminary stage (b).

Step 3. For each simulated incident case m, make a random drawfrom a Bernoulli distribution with probability pm asparameter. This gives a binary variable Bm with value 1

valence 15 year prevalence 20 year prevalence

85) 188 (162, 215) 209 (181, 237)

10.7) 10.9 (9.36, 12.4) 12.1 (10.4, 13.7)

52) 155 (128, 181) 175 (147, 203)

8.27) 8.41 (6.97, 9.84) 9.52 (8.00, 11.0)

98) 222 (192, 251) 266 (234, 298)

11.4) 12.8 (11.1, 14.5) 15.4 (13.5, 17.2)

63) 183 (156, 211) 226 (195, 257)

8.84) 9.96 (8.47, 11.5) 12.3 (10.6, 14.0)

15) 254 (222, 285) 312 (276, 347)

12.4) 14.7 (12.9, 16.5) 18.0 (16.0, 20.0)

79) 211 (179, 242) 265 (231, 299)

9.70) 11.4 (9.74, 13.1) 14.4 (12.5, 16.3)

100,000 of the population.

Fig. 3. (Upper) Overall survival in the sample. (Lower) Modelled survival for 70-

year-old men (blue lines) and women (red lines) compared with observed survival

(green solid line, dashed line 95% confidence interval). (For interpretation of the

references to colour in this figure legend, the reader is referred to the web version of

the article.)

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199 197

corresponding to the simulated case m being a prevalentcase at the index date and value 0 corresponding to a casethat has died before the index date.

Step 4. Sum the values of Bm over all the simulated incident casesm for each of the k time periods to give the values bi

(1 � i � k). Each of the bi corresponds to the number ofsimulated prevalent cases at the index date that wereincident during the time period i. So if years were chosen asthe time periods in step 1, then each bn estimates nth-yearprevalence and the sum of the bi (1 � i � n) estimates n-year prevalence.

Step 5. For each incident case m with Bm = 1, collect the values ofthe survival explanatory variables and other ancillaryvariables. These can either simply be collected variableby variable over all the k time periods or they can becollected variable by variable for each of the k timeperiods (if estimates of the distributions of thesevariables are required for each time period before theindex date).

So far, this algorithm has calculated a single estimate for each ofnth-year and n-year prevalence (if the k time periods were chosenas years) and of the distribution of the explanatory and ancillaryvariables. The algorithm is now repeated N_sim times in order toobtain the sampling distribution of these estimates. For example,in the case of 20-year prevalence P20 we would have calculatedN_sim estimates of P20.

The mean of the distribution of these estimates provides anoverall estimate of P20 and the 2.5% and 97.5% quantiles of thedistribution provide a 95% confidence interval for the estimate ofP20. Estimates and confidence intervals for estimates of the otherquantities calculated by this algorithm can be determined insimilar fashion.

It is clear that this algorithm facilitates step 3 of the prevalencecalculation schema of Gigli et al. [2]: estimates of incidence andsurvival are made from the registry data and these are used togenerate prevalence estimates for the cases incident before thestart of the registry. What is perhaps less clear is that the algorithmalso provides estimates for the years covered by the diseaseregistry that are unbiased with respect to loss to follow-up,provided that survival estimates are based on survival times thatare censored at the time of loss to follow-up. Lost-to-follow-upsubjects are simulated using an analogue of the above algorithmand the resulting estimates of prevalent lost-to-follow-up casesare added to the observed totals.

2.3. AML example

For the AML data, incidence was modelled as a homogeneousPoisson process, consistent with the observed incidence in theregistry (Table 2, note c). Observed incidence rates were calculatedfor both sexes separately for each of the seven years for whichcomplete data were available and the mean of these yearly rateswas taken as the yearly Poisson rate. For each year before the indexdate, the number of incident cases was simulated by a draw from anormal distribution with mean equal to the Poisson rate andvariance given by the Poisson rate divided by seven. The times ofthese incident cases in each year were then drawn from a uniformdistribution, corresponding to the incidence times of a homoge-neous Poisson process.

AML cohort survival was modelled using a parametric Weibullmodel with age and sex as explanatory variables. Coefficients fromfitting N_sim bootstrap samples were pre-calculated and stored forthe main simulation. Population survival by age and sex wasestimated using the LSHTM Tables[16] for the same region as theAML cohort, averaged over the final five years of data (2005–2009).

We have calculated prevalence estimates for AML under theassumptions of no cure, 3-year cure and 5-year cure.

In the simulation, the age distribution of the simulated incidentcases was simulated by drawing, for each year, a sample with sizeequal to the Poisson rate for that year from the actual agedistribution of cases. For each prevalent simulated case at theindex date, age was recorded. Finally, means of the number of maleand female prevalent cases were taken together with appropriatequantiles to estimate 95% confidence intervals. For this exercise,N_sim was set to 1000. In general N_sim may be chosen to ensurestable repeatable results to a chosen degree of precision.

All calculations were performed in R version 3.0.1 [17] using the‘‘survival’’ [18] and ‘‘rms’’ [19] libraries.

3. Results

Incidence (see Table 2) amongst males was estimated at 84.7cases per year (4.90 per 100,000 per year); amongst females at 69.6cases per year (3.78 per 100,000 per year). The observed incidenceby year was consistent with being drawn from a homogeneousPoisson process with these rates (p = 0.34 for males, p = 0.30 forfemales; p values by simulation).

Of the 934 subjects, 58 died on the same day as or before finaldiagnosis was made. These subjects were allotted a nominal 1-daysurvival time. Overall survival, estimated by the Kaplan–Meiermethod, is shown in the upper figure of Fig. 3. Weibull survivalmodelling showed age at diagnosis to be a highly significantpredictor (p < 10�10). Modelling the effect of age on survival usingspline techniques showed a slight (but statistically significant)non-linear effect; however, the magnitude of this non-linearity

Fig. 4. Age distribution of incident cases. Black dots represent observed number of

incident cases in the registry by year of age at diagnosis; the blue line is a smoothing

fit to this distribution. (For interpretation of the references to colour in this figure

legend, the reader is referred to the web version of the article.)

Fig. 5. Estimated nth year prevalence (upper) and cumulative n-year prevalence

(lower) by year before index date. Male subjects are represented by blue dots and

line; females by red dots and lines. (For interpretation of the references to colour in

this figure legend, the reader is referred to the web version of the article.)

Fig. 6. Incidence (red) and prevalence (green) age distributions for 20 year

prevalence (5 year cure model). Distribution overlap shows in darker green. (For

interpretation of the references to colour in this figure legend, the reader is referred

to the web version of the article.)

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199198

was small and had little effect on survival predictions; thereforeage was modelled linearly in the final procedure. Sex, on the otherhand was not associated with survival (p = 0.23). However, weretain sex in the model because for long term survival derivingfrom population data, sex is a significant explanatory variable formortality. Survival model fit was assessed by inspection ofpredicted survival for ages 20, 30, 40, 50, 60, 70, 80, 90 agescompared with the Kaplan Meier estimate for all ages three yearseither side of each reference age. An example (for 70-year olds) isshown in the lower figure of Fig. 3. The distribution of age atdiagnosis among the incident cases is shown in Fig. 4 together witha smooth of the plot. Fig. 2 shows predicted survival from theWeibull model, together with the predicted survival from the 5-year cure model (in red).

The overall simulation model was internally validated (seeTable 2) by predicting the prevalent cases arising from the knownnumber of incident cases for the six years 2005–2011 and bycomparing this prediction with the actual number of prevalentcases at the index date. Model fit gave no cause for concern: no curemodel chi-squared = 7.6 on 11 d.f. (p = 0.75); 5 year cure modelchi-squared = 7.3 on 11 d.f. (p = 0.78); 3 year cure model chi-squared = 7.1 on 11 d.f. (p = 0.79).

Prevalence estimates from the model, together with 95%confidence intervals, are shown in Table 3. Prevalence for thefirst six years before the index date was simply counted from theobserved data. Prevalence from before six years prior to the indexdate was estimated from the simulation model. Fig. 5 plots the nthyear (upper figure) and the n-year (lower figure) prevalence byyear before index date. As an illustration of the method, Fig. 6shows the age distribution of incident cases as well as the agedistribution of 20-year prevalent cases (assuming the 5-year curemodel) at the index date. It is clear from Fig. 6 that the 20-yearprevalent population is younger (median age 56.0) than theincident population (median age 71.7).

4. Discussion

We have presented an approach to the computation ofestimates of disease prevalence based on the three-step schemaof Gigli et al. [2], substituting simulation based techniques foranalytical techniques in steps 2 and 3.

The increasing availability of powerful computing resources hasmeant that epidemiological calculations are increasingly amenableto Monte-Carlo simulation techniques. Estimation of the preva-lence of disease in the population from incidence and survivalcharacteristics is ideally suited to a simulation approach, as is the

estimation of the precision of these estimates. In addition to theconceptual simplicity of the approach, simulation allows for themodelling of more complex incidence and survival characteristics.For example, seasonal variation in incidence might be modelled bya periodic non-homogenous Poisson process. Similarly, temporalchanges in the incidence process (due, for example, to changes indisease classification or to new preventative measures) or in thesurvival process (due, for example, to improvements in treatment)

S. Crouch et al. / Cancer Epidemiology 38 (2014) 193–199 199

can easily be incorporated into the simulation, provided that theycan be modelled from observed data.

Simulation also facilitates the calculation of other quantitiesthat are not so easily calculated by conventional means. We havegiven the example of the prevalence age distribution compared tothe incidence age distribution above; provided that the distribu-tion of subject characteristics can be measured and sampled from,this example can be generalized to these other characteristics. Wealso note that the method allows for the calculation of n-year andnth year prevalence for any n, or indeed for the calculation ofprevalence at an index date arising from incidence cases from anytime interval before that date; similarly future projections ofprevalence are easily calculated, provided that future incidenceand survival can be modelled (using age-period-cohort techni-ques or discrete time models [22], for example).

As we have noted, the simulation approach very naturally allowsfor the estimation of the precision of prevalence estimates, takinginto account all the possible sources of variation: the precision of theincidence rate estimate; the stochastic nature of the incidenceprocess; and the precision of the survival estimate. Previous authors[2,14] have noted that the precision of the prevalence estimate willbe approximately governed by the variance of the incidence process;our simulations have confirmed this approximation, but haveallowed for more precise estimates of this precision.

We have illustrated our new technique with the example ofacute myeloid leukaemia (AML). From the point of view ofprevalence calculations, AML is a disease for which overall survivalis poor (3 year survival 18%, Fig. 3 upper figure) but for which cureis possible with increasing probability with decreasing age (Fig. 2).In addition, we observe that the age distribution of incident cases isheavily skewed towards old age (Fig. 4). The consequence of thesefacts is that the age distribution of prevalent cases is markedlydifferent from that of incident cases (Fig. 6). Although we haveillustrated this effect for 20-year prevalence, the same remainstrue for n-year prevalence estimates for small n (data not shown);survival is so poor for elderly patients that the form of the longterm prevalence age distribution is reached rapidly.

In conclusion, the new technique for the calculation of diseaseprevalence that we have presented in this paper provides a meansfor prevalence estimation from incidence and survival data that ismore flexible than the previous techniques that have appeared inthe literature. Although we have illustrated the technique withthe example of AML, an acute disease with poor overall survival,the technique not limited to diseases with such characteristics.Provided that incidence and survival can be modelled for thepopulation under study, the simulation technique can be applied.

Source of funding

The authors are supported by a programme grant fromLymphoma and Leukaemia Research (http://leukaemialympho-maresearch.org.uk/).

Conflicts of interest

No conflicts of interest declared.

References

[1] Greenland S, Rothman KJ. Measures of occurrence. In: Rothman KJ, GreenlandS, Lash TL, eds. Modern epidemiology. 3rd ed., Philadelphia: LippincottWilliams & Wilkins, 2012: 32–50.

[2] Gigli A, Mariotti A, Clegg L, Tavilla A, Corazziari I, Capocaccia R, et al. Estimatingthe variance of cancer prevalence from population-based registries. StatMethods Med Res 2006;15:235–53.

[3] Verdecchia A, Capocaccia R, Egidi V, Golini A. A method for the estimation ofchronic disease morbidity and trends from mortality and survival data. StatMed 1989;8:201–6.

[4] Capocaccia R, De Angelis R. Estimating the completeness of prevalence basedon cancer registry data. Stat Med 1997;16:425–40.

[5] Gail M, Kessler L, Midthune D, Scoppa S. Two approaches for estimating diseaseprevalence from population-based registries of incidence and total mortality.Biometrics 1999;55:1137–44.

[6] Verdecchia A, De Angelis G, Capocaccia R. Estimation and projections of cancerprevalence from cancer registry data. Stat Med 2002;21:3511–26.

[7] Preston P. Relations among standard epidemiological measures in a popula-tion. Am J Epidemiol 1987;126:336–45.

[8] Keiding N. Age-specific incidence and prevalence: a statistical perspective. J RStat Soc A 1991;154:371–412.

[9] Capocaccia R. Relationships between incidence and mortality in non-reversible diseases. Stat Med 1993;12:2395–415.

[10] Maddams J, Brewster D, Gavin A, Steward J, Elliott J, Utley M, et al. Cancerprevalence in the United Kingdom: estimates for 2008. Br J Cancer 2009;101:541–7.

[11] Merrill R, Capocaccia R, Feuer E, Mariotti A. Cancer prevalence estimates basedon tumour registry data in the Surveillance, Epidemiology, and End Result(SEER) program. Int J Epidemiol 2000;29:197–207.

[12] Pisani P, Bray F, Parkin D. Estimates of the world-wide prevalence of cancer for25 sites in the adult population. Int J Cancer 2002;97:72–81.

[13] Maddams J, Utley M, Moller H. Projections of cancer prevalence in the UnitedKingdom, 2010–2040. Br J Cancer 2012;107:1195–202.

[14] Capocaccia R, Colonna M, Corazziari I, De Angelis R, Francisci S, Micheli A, et al.Measuring cancer prevalence in Europe: the EUROPREVAL project. Ann Oncol2002;13:831–9.

[15] Smith A, Roman E, Howell D, Jones R, Patmore R, Jack A. The HaematologicalMalignancy Research Network (HMRN): a new information strategy forpopulation based epidemiology and health service research. Br J Haematol2009;148:739–53.

[16] Cancer Research UK Cancer Survival Group. strel computer program and lifetables for cancer survival analysis. UK: Department of Non-CommunicableDisease Epidemiology, London School of Hygiene & Tropical Medicine, 2013,Available from www.lshtm.ac.uk/ncde/cancersurvival/tools (accessed30.07.13)..

[17] R Core Team. R: a language and environment for statistical computing. Vienna,Austria: R Foundation for Statistical Computing, 2013, http://www.R-projec-t.org/.

[18] Therneau TM, Grambsch PM. Modeling survival data: extending the Coxmodel. New York: Springer, 2000.

[19] Harrell Jr FE. rms: Regression Modeling Strategies. R package version 3. 6-3;2013, http://CRAN.R-project.org/package=rms.

[20] Rizzo NL. Statistical computing with R. London: Chapman & Hall, 2008.[21] Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge:

Cambridge University Press, 1997.[22] Fiorentino F, Maddams J, Moller H, Utley M. Modelling to estimate future

trends in cancer prevalence. Health Care Manage Sci 2011;14:262–6.


Recommended