+ All Categories
Home > Documents > Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR...

Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR...

Date post: 04-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
9
Volume-based Response Evaluation with Consensual Lesion Selection: A Pilot Study by Using Cloud Solutions and Comparison to RECIST 1.1 Estanislao Oubel, PhD, Eric Bonnard, MD, Naoko Sueoka-Aragane, MD, PhD, Naomi Kobayashi, MD, Colette Charbonnier, PhD, Junta Yamamichi, MSc, MPH, Hideaki Mizobe, MSc, Shinya Kimura, MD, PhD Rationale and Objectives: Lesion volume is considered as a promising alternative to Response Evaluation Criteria in Solid Tumors (RE- CIST) to make tumor measurements more accurate and consistent, which would enable an earlier detection of temporal changes. In this article, we report the results of a pilot study aiming at evaluating the effects of a consensual lesion selection on volume-based response (VBR) assessments. Materials and Methods: Eleven patients with lung computed tomography scans acquired at three time points were selected from Refer- ence Image Database to Evaluate Response to therapy in lung cancer (RIDER) and proprietary databases. Images were analyzed accord- ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. Cloud solutions were used to connect readers and carry out a consensus process on the selection of lesions used for computing response. Because there are not currently accepted thresholds for computing VBR, we have applied a set of thresholds based on measurement variability (35% and +55%). The benefit of this consensus was measured in terms of multiobserver agreement by using Fleiss kappa (k fleiss ) and corresponding standard errors (SE). Results: VBR after consensual selection of target lesions allowed to obtain k fleiss = 0.85 (SE = 0.091), which increases up to 0.95 (SE = 0.092), if an extra consensus on new lesions is added. As a reference, the agreement when applying RECIST without consensus was k fleiss = 0.72 (SE = 0.088). These differences were found to be statistically significant according to a z-test. Conclusions: An agreement on the selection of lesions allows reducing the inter-reader variability when computing VBR. Cloud solutions showed to be an interesting and feasible strategy for standardizing response evaluations, reducing variability, and increasing consistency of results in multicenter clinical trials. Key Words: Clinical trials; RECIST; cloud computing; consensus; lesion volume; biomarkers; volume thresholds. ªAUR, 2015 A ccording to recent statistics of the World Health Or- ganization, cancer is the leading cause of death world- wide accounting for 8.2 million deaths in 2012 (1). Lung cancer accounted for 1.6 million deaths (19.4%), which makes it the first cause of cancer death even ahead of liver (9.1%) and stomach (8.8%) cancers. Computed tomography (CT) is currently the standard imaging modality for assessing the response to treatment in patients with solid tumors. In general, the quantification of the response is performed by us- ing Response Evaluation Criteria in Solid Tumors (RECIST) (2,3). In summary, this standard establishes the way of measuring lesions and provides a set of thresholds to classify the response into partial response, stable disease, and progressive disease. In the context of this article, the term RECIST refers to its revised version (1.1) (3). Even when RECIST criteria have been broadly adopted in clinical trials, they present some drawbacks that are a source of measurement variability. This is quite inconvenient because the consistency in the production of trial results is necessary for com- parison purposes (4). For example, the lesion size is measured as its longest axial diameter, which is not a robust measure in case of complex lesions and creates problems of accuracy and precision (5). To cope with this drawback, the use of volume is currently being considered as a promising direction to make tumor mea- surements more accurate and consistent, which would enable an earlier detection of temporal changes (6–9). The benefit of using volume as biomarker has already been reported in the literature (5,10); however, the use of volume also presents some Acad Radiol 2015; 22:217–225 From the R&D Department, MEDIAN Technologies, Les Deux Arcs B, 1800 Route des Cr^ etes, Valbonne 06560, France (E.O., C.C.); Radiology Department, Nice University Hospital, Nice, France (E.B., S.K.); Division of Hematology, Respiratory Medicine and Oncology, Department of Internal Medicine, Faculty of Medicine, Saga University, Saga, Japan (N.S.A., N.K.); and Global Healthcare IT Project, Medical Equipment Group, Canon Inc, Tokyo, Japan (J.Y., H.M.). Received April 25, 2014; accepted September 20, 2014. Address correspondence to: E.O. e-mail: estanislao.oubel@ mediantechnologies.com ªAUR, 2015 http://dx.doi.org/10.1016/j.acra.2014.09.008 217
Transcript
Page 1: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Volume-based Response Evaluationwith Consensual Lesion Selection:

A Pilot Study by Using Cloud Solutions and Comparison to RECIST 1.1

Estanislao Oubel, PhD, Eric Bonnard, MD, Naoko Sueoka-Aragane, MD, PhD,Naomi Kobayashi, MD, Colette Charbonnier, PhD, Junta Yamamichi, MSc, MPH,

Hideaki Mizobe, MSc, Shinya Kimura, MD, PhD

Rationale and Objectives: Lesion volume is considered as a promising alternative to Response Evaluation Criteria in Solid Tumors (RE-

CIST) to make tumor measurements more accurate and consistent, which would enable an earlier detection of temporal changes. In this

article, we report the results of a pilot study aiming at evaluating the effects of a consensual lesion selection on volume-based response(VBR) assessments.

Materials and Methods: Eleven patients with lung computed tomography scans acquired at three time points were selected from Refer-

ence Image Database to Evaluate Response to therapy in lung cancer (RIDER) and proprietary databases. Images were analyzed accord-ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. Cloud solutions were used to connect

readers and carry out a consensus process on the selection of lesions used for computing response. Because there are not currently

accepted thresholds for computing VBR, we have applied a set of thresholds based on measurement variability (�35% and +55%).The benefit of this consensuswasmeasured in terms ofmultiobserver agreement by using Fleiss kappa (kfleiss) and corresponding standard

errors (SE).

Results: VBR after consensual selection of target lesions allowed to obtain kfleiss = 0.85 (SE = 0.091), which increases up to 0.95

(SE = 0.092), if an extra consensus on new lesions is added. As a reference, the agreement when applying RECIST without consensuswas kfleiss = 0.72 (SE = 0.088). These differences were found to be statistically significant according to a z-test.

Conclusions: An agreement on the selection of lesions allows reducing the inter-reader variability when computing VBR. Cloud solutions

showed to be an interesting and feasible strategy for standardizing response evaluations, reducing variability, and increasing consistencyof results in multicenter clinical trials.

Key Words: Clinical trials; RECIST; cloud computing; consensus; lesion volume; biomarkers; volume thresholds.

ªAUR, 2015

According to recent statistics of the World Health Or-

ganization, cancer is the leading cause of death world-

wide accounting for 8.2 million deaths in 2012 (1).

Lung cancer accounted for 1.6 million deaths (19.4%), which

makes it the first cause of cancer death even ahead of liver

(9.1%) and stomach (8.8%) cancers. Computed tomography

(CT) is currently the standard imaging modality for assessing

the response to treatment in patients with solid tumors. In

Acad Radiol 2015; 22:217–225

From the R&D Department, MEDIAN Technologies, Les Deux Arcs B, 1800Route des Cretes, Valbonne 06560, France (E.O., C.C.); RadiologyDepartment, Nice University Hospital, Nice, France (E.B., S.K.); Division ofHematology, Respiratory Medicine and Oncology, Department of InternalMedicine, Faculty of Medicine, Saga University, Saga, Japan (N.S.A., N.K.);and Global Healthcare IT Project, Medical Equipment Group, Canon Inc,Tokyo, Japan (J.Y., H.M.). Received April 25, 2014; accepted September 20,2014. Address correspondence to: E.O. e-mail: [email protected]

ªAUR, 2015http://dx.doi.org/10.1016/j.acra.2014.09.008

general, the quantification of the response is performed by us-

ing Response Evaluation Criteria in Solid Tumors (RECIST)

(2,3). In summary, this standard establishes the way of

measuring lesions and provides a set of thresholds to classify

the response into partial response, stable disease, and

progressive disease. In the context of this article, the term

RECIST refers to its revised version (1.1) (3).

Even when RECIST criteria have been broadly adopted in

clinical trials, they present some drawbacks that are a source of

measurement variability. This is quite inconvenient because the

consistency in the production of trial results is necessary for com-

parison purposes (4). For example, the lesion size is measured as

its longest axial diameter, which is not a robust measure in case of

complex lesions and creates problems of accuracy and precision

(5). To cope with this drawback, the use of volume is currently

being considered as a promising direction to make tumor mea-

surements more accurate and consistent, which would enable

an earlier detection of temporal changes (6–9). The benefit of

using volume as biomarker has already been reported in the

literature (5,10); however, the use of volume also presents some

217

Page 2: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015

drawbacks like the long segmentation time in case of big lesions

(in particular for thin-slice acquisitions) and the lack of accepted

thresholds for computing response. Regarding this last point, it is

worth mentioning the intensive efforts performed by the Quan-

titative Imaging Biomarkers Alliance (11) to better understand

volumetric biomarkers and their sources of variability. In the

context of this article, the term volume-based response (VBR)

refers to the response estimated exactly as establishedbyRECIST,

except for the use of volume of lesions and volume-specific

thresholds instead of diameters. The use of VBR was preferred

to names like 3D-RECIST to avoid confusion with the applica-

tion of RECISTwith 3D-extended thresholds and also because

currently there is no formal definition of the volumetric version

of RECIST.

Another reported source of variability of RECIST is the dif-

ference in the target lesions (TLs) selected for computing

response (12,13). TLs are selected on the basis of their size and

their suitability for reproducible measurements (6). Even when

it is relatively easy to measure the lesion size required for its in-

clusion, it is more difficult to assess measurement repeatability

by visual inspection only. For example, the edges of irregular

or infiltrating lesions are often difficult to define and, in some

cases, they are even impossible to measure (14). Another

commonly found obstacle is the presence of peritumoral fibrosis,

which is difficult to distinguish from tumor spread and adds

further uncertainty to measurements. A very interesting discus-

sion about limitations of RECIST can be found in (15).

To cope with the problems of variability mentioned before,

regulatory authorities have been recommending the use of an

Independent Central Review (ICR). The original purpose of

ICR as recommended by the Food and Drug Administration

(FDA) was to eliminate the bias associated with local evaluation

(LE). This is thought to be relevant for studies in which the

investigator knows the patients enrolled in different trial

arms. Another advantage of ICR is the possibility of standard-

izing evaluations because all patients are evaluated by a

restricted number of reviewers (typically one reviewer and

one adjudicator) using the same software solutions. However,

the advantages of ICR over LE have been questioned in the

literature (16,17,18), and some alternatives to ICR like the

use of audit tools are currently being considered (19,20). The

main criticism is that ICR does not remove bias completely,

and in some cases, it may introduce bias by itself (eg, by

informative censoring). Finally, the implementation and

management of the ICR process is costly and burdensome

for sites and sponsors.

Several analyses, notably the one by the Pharmaceutical

Research and Manufacturers Association Working Group

(21), show that no systematic bias is introduced by LE. The

same conclusions have been presented recently by the FDA,

who proposes the use of ICR only as an audit tool to detect

evaluation bias in LE assessments (19). However, in an LE

approach, there is an intrinsic variability among sites because

evaluation protocols cannot be guaranteed to be the same or

equivalent. Therefore, a standardization of the evaluation

could make results from different sites more consistent.

218

Cloud computing offers an opportunity for reducing the

variability in the application of RECIST in a LE approach

because it can facilitate the interaction between different

stakeholders and provide common tools for image analysis

(which reduces the variability coming from the use of

different measurement systems). The application of cloud

computing in the context of clinical trials has become

possible, thanks to the availability of high-capacity networks,

low-cost computers, and storage services. The huge amount

of data generated by medical imaging acquisition systems

makes cloud computing an interesting alternative for process-

ing, storing, and sharing images. Besides, clouds free cus-

tomers (eg, hospitals) from the responsibility of installing

and maintaining hardware and basic computational services

because these tasks are performed by cloud providers.

In this article, we have evaluated a cloud-based system to

perform a consensus on the selection of lesions used for

computing VBR. As suggested by previous publications

(12,13), this is expected to reduce the inter-reader variability

in LE approaches. Evenwhen in this articlewe focus on the spe-

cific problem of lesion selection, the proposedworkflow can be

extended to control other sources of variability. The final objec-

tive is to make the application of RECIST more consistent

among sites. Several contributions are provided in this article.

First, we show the feasibility of applying cloud computing in

the context of multicenter clinical trials. To do this, we propose

a novel cloud-basedworkflowenabling amore consistent appli-

cation of RECISTamong participant sites. Second, we applied

this workflow to investigate the potential benefits of a lesion

consensus on the response. This is only one specific application

among others that illustrates how the system could be used, and

the advantages provided. Third, we compare the proposed

approach with respect to the current standard and LE. Finally,

we analyze the impact on the RECIST compliance, which is

expected to be higher after a consensus process.

MATERIALS AND METHODS

Data Set

Eleven patients (aged 62 years in average; six men and 5

women) presenting solid lesions were retrospectively selected

fromRIDER (22) (6 patients) andMedian Technologies’ pro-

prietary databases (5 patients). The selection criteria were the

type of lesion (solid), location (lung), number of acquired time

points (at least at baseline, 3, and 6 months), and homogeneity

of image acquisition parameters (eg, filter type). Chest CT

scans with contrast were performed by using General Electric

LightSpeed and Siemens Sensation scanners. Image acquisi-

tion was performed at 120kVp and (354 � 103 mAs). The

slice thickness was #2.5 mm (1.93 � 0.63 mm), and the in-

plane resolution was#0.78 mm (0.71� 0.06 mm). A filtered

back projection reconstruction method was used in all cases

with B60f (Siemens) and Lung (GE) convolution kernels.

Lesions were selected and measured by two oncologists

with 27 and 12 years of experience (the on-site readers) and

Page 3: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Figure 1. Distribution of lesions according to its longest axis diam-eter (D). (Color version of figure is available online.)

Figure 3. Distribution of lesions according to surroundings.JP, juxta-pleural; JV, juxta-vascular; PF, perifissural; SL, solitary

lesion. (Color version of figure is available online.)

Figure 2. Distribution of lesions according to location. LLL, left

lower lobe; LUL, left upper lobe; MED, mediastinum; RLL, right lower

lobe; RML, right middle lobe; RUL, right upper lobe. (Color version offigure is available online.)

Figure 4. Distribution of lesions according to shape. (Color version

of figure is available online.)

Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS

a radiologist with 5 years of experience (the independent

reviewer [IR]). The oncologists are specialized in clinical trials

and perform RECIST evaluations in the clinical routine.

Figures 1–4 show the distribution of the selected lesions

according to different characteristics.

Cloud-based Setup System

Figure 5 shows the set up cloud solution and the different

stakeholders involved in the reading process, which can be

explained in a general manner as follows:

1) The Data Managers (DMs) perform a quality control of

images and, if the quality control test is successful, they

are transferred to the data center where the processing

required before analysis is performed (image identifica-

tion, image reconstruction, lung mask extraction, and

so forth). Once the processing is completed, the DMs

make the images available to the IR and readers for

analysis.

2) The reader recovers the images of a particular patient from

the data center and performs a lesion segmentation and

response assessment according to RECIST and VBR.

The performed analysis is saved into the data center for

it to be reviewed by the IR.

3) The reader and IR start exchanging opinions about the

selected TLs, and the reader is allowed to modify his selec-

tion if the feedback provided is considered to be relevant.

This process named ‘‘consensus’’ is coordinated by the

DMs, who are in charge of keeping records of changes

in evaluations.

4) Once an agreement or maximum number of iterations (set

to four in this study) is reached, the final response is calcu-

lated. For clarity and readability purposes, sometimes it is

necessary to emphasize under which conditions (with or

without consensus) the response was calculated. Therefore,

we use a superscript (asterisk) to distinguish between re-

sponses computed with and without consensus. In this

way, RECIST* and VBR* refer to responses calculated

219

Page 4: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Figure 5. Architecture of the setup cloud

solution showing the different stakeholders

involved in the reviewing process. UK,United Kingdom. (Color version of figure

is available online.)

OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015

after consensus; RECIST and VBR refer to the response

before consensus or when talking about the criteria in a

general manner.

The geographic location of the different stakeholders was

different, which allowed evaluating the solution for a potential

use in multicountry clinical trials. DMs were based in Sophia

Antipolis (France), the readers in Saga (Japan) and Glasgow

(United Kingdom), the IR in Nice (France), and the data cen-

ter (Canon IT Solutions Inc.) was installed in Tokyo (Japan). A

more detailed workflow of the reading process is shown in

Figure 6.

Volume Measurement

The lesion volume was computed from segmentations per-

formed by using LesionManagement Solutions (Median Tech-

nologies, Sophia Antipolis, France). If necessary, the initial

results provided by the semiautomatic segmentation were cor-

rected by using interactive segmentation tools implementing

shape-based interpolation methods (23). To remove normal

lung tissue from the lesion, a final threshold step is automatically

applied after each correction performed by the user.

Response Agreement

Target Lesions. We have evaluated the impact of the consensus

on the inter-reader response agreement by comparing

TL-based responses before and after consensus. As explained

before, the response without consensus is always assessed

before IR’s feedback, and it is potentially modified after

this. Therefore, both responses are available, and a comparison

of agreements can be carried out. We have quantified the

220

inter-reader agreement by using the Fleiss Kappa (kfleiss) statis-

tic (24). The Fleiss Kappa allows computing the agreement

between multiple observers when assigning categorical ratings

to a number of statistical units.

The inter-reader agreement was also computed for the tu-

mor burden (sum of lesion volumes) and its change over time.

The objective of this analysis was to evaluate the impact of the

consensus on the measure itself, independently of thresholds.

This is important because the use of thresholds could hide dif-

ferences in the continuous values, which is an interesting

result even if no changes are observed in the categorical

data. This agreement was quantified by using reproducibility

coefficient values (25), which can be computed for multiple

observers. A Shapiro-Wilk test of normality (26) was per-

formed, and a logarithmic transformation was applied to

data in case of null hypothesis rejection for normalization

purposes.

New Lesions. The consensus process was performed on TLs

only. However, the impact of new lesions (NLs) on the

response is strong because the appearance of at least one

NL changes automatically the response status to progress

disease according to RECIST, independently of changes

in TLs. Therefore, the effect of a potential consensus on

NL on the response deserves special consideration. To

analyze this, we simulated a consensus on NL by assigning

to all readers a NL identified by at least one reader. For

example, suppose that the reader #1 identifies an NL at

given time point of a given patient, but this lesion was

missed by both reader #2 and IR, then this NL is also taken

into account for computing the responses of reader #2 and

IR. The Fleiss Kappa was also used to measure the inter-

reader agreement.

Page 5: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Figure 6. UnifiedModel Language diagram of the reading process. The sequence part inside the loop corresponds to the consensus process

aiming to apply RECIST uniformly. IR, independent reviewer.

Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS

Volume Thresholds. Currently, there are no accepted thresh-

olds for assessing response based on volume changes. One

could simply apply RECIST thresholds after extrapolation

to three dimensions, but, according to Mozley et al. (10),

this is known to work well only in some situations where

the tumor morphology is simple, but it cannot be applied

in a general manner. Therefore, we preferred to apply

thresholds empirically estimated from measurements of

interobserver variability and established to +55% for pro-

gression and �35% for response (27). The rationale behind

the use of this set of thresholds is that the actual differences

in lesion size can be hidden by the variability of the measure.

Then, for a difference in size to be measured significantly, it

must be necessarily larger than this variability. This way of

estimating thresholds is similar to the approach followed by

Mozley et al. (10), the main difference is the application of

the Geary-Hinkley transformation to model response

evaluation.

Data Analysis. All statistical analyses were performed with the

R software (28). The Fleiss kappa was computed by using the

irr package (29), and the Shapiro-Wilk test of normality was

computed by using the stats package. A z-test on original

and bootstrapped data was applied to assess the statistical sig-

nificance of differences in kfleiss.

221

Page 6: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

TABLE 1. Fleiss Kappa and Standard Errors (BetweenParentheses) According to Different Types of Consensus andCriteria

ConsensusTL Only

Response RECIST VBRTL NL

— — X 0.72 (0.808)/SuA 0.76 (0.090)/SuA

— — — 0.72 (0.090)/SuA 0.76 (0.090)/SuA

X — X 0.72 (0.088)/SuA 0.85 (0.091)/APA

X — — 0.72 (0.088)/SuA 0.85 (0.092)/APA

X X — 0.81 (0.089)/APA 0.95 (0.093)/APA

APA, almost perfect; NL, new lesion criteria; RECIST, Response

Evaluation Criteria in Solid Tumors; SuA, substantial; TL, target

lesion; VBR, volume-based response.

Consensus on TL, NL, and TL only response has been marked with

‘‘X’’ when they are applied and ‘‘—’’ when they are not. The interpre-

tation of values according to Landis andKoch (30) has been added as

follows: slight/fair/moderate/substantial/almost perfect agreement.

The highest value is shown in bold.

TABLE 2. Reproducibility Coefficient for ln(SLV), ln(DSLVBL),and ln(DSLVNADIR) for Consensual and NonconsensualSelection of target lesions

Consensus ln(SLV) ln(DSLVBL) ln(DSLVNADIR)

No 2.081 1.167 0.915

Yes 0.593 0.678 0.677

SLV, sum of lesion volumes; DSLVBL, change in SLV with respect to

baseline (BL); DSLVNADIR, change in SLV with respect to nadir.

TABLE 3. Contingency Table of Responses Provided byRECIST and VBR*

VBR*

RECIST

PR SD PD Total

PR 14 15 0 29

SD 0 8 2 10

PD 2 2 23 27

Total 16 25 25 66

PD, progressive disease; PR, partial response; RECIST, Response

Evaluation Criteria in Solid Tumors; SD, stable disease; VBR,

volume-based response.

VBR* was computed after consensus on target lesion and new

lesion, whereas RECIST was calculated in the classical way (without

consensus).

OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015

RESULTS

Response Agreement

Table 1 shows the inter-reader agreement measured by Fleiss

kappa agreements (24) for different types of consensus and

response criteria. The table shows that the use of volume pro-

vides higher levels of agreement than RECIST. The table also

shows that, differently from one-dimensional measurements,

the VBR does benefit from the consensus on TLs (RECIST

and RECIST* provide similar results). A paired sample

z-test for mean of paired differences between VBR* and

RECIST rejected the null hypothesis in favor of the alterna-

tive hypothesis of higher agreement for VBR* (z-value=1.77;

P value = .038). This means that the difference found between

both approaches is statistically significant. A z-test was also

applied to compare VBR* and VBR (ie, to measure the ef-

fects of consensus only). In this case, the agreement with

consensus was significantly higher than without consensus at

a 10% level (z-value = 1.46; P value = .072). Finally, when

comparing RECIST versus VBR, differences were not statis-

tically significant in favor of volume (z-value = 0.31; P

value = .378).

We have also performed z-tests on 1000 bootstrapped com-

binations of the original ratings. In this case, the z-test rejected

the null-hypotheses with P value <.01 for all three compari-

sons performed before.

Table 2 shows reproducibility index values (25) for the sum

of lesion volumes (SLV), its change with respect to baseline

(DSLVBL), and its change with respect to nadir (DSLVNADIR).

A Shapiro-Wilk test of normality showed that the distribu-

tions of these parameters were not normal. To reduce these

deviations from normality, the values were transformed by us-

ing the logarithmic function. The success of this correction

was verified by a new application of the normality test which

showed that the null hypothesis could not be rejected. Table 2

222

shows that the reproducibility index values decrease after

consensus (ie, the inter-reader agreement increases), which

is consistent with the results presented in Table 1. These results

provide further evidence about the reduction of variability

when there is an agreement on the lesions to be used for

response assessment.

Comparison Between VBR and RECIST

Sometimes, the response agreement may be meaningless. For

example, if thresholds are too high, only one type of response

and 100% of agreement are obtained; however, this result is

meaningless because the set of thresholds is not discriminative.

As RECIST is an already validated and broadly accepted

criteria by the scientific community, a new response criteria

is expected to provide similar results for a large population

(but not exactly the same for the new criteria to be inter-

esting). The kappa value between RECIST and VBR* on

TLs and NLs was equal to 0.53 (SE = 0.152), which corre-

sponds to a moderate agreement according to the classification

by Landis and Koch (30).

Besides the kappa value itself, it is interesting to analyze the

contingency table used for its calculation. Table 3 shows that

the matrix of contingency presents some asymmetries

deserving further analysis. Even when the lack of ground truth

precludes an analysis of accuracy, a comparison between

methods can be performed. When rows and columns of the

contingency matrix are ordered by decreasing order of

response, this matrix becomes close to upper triangular,

Page 7: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

TABLE 4. Response Agreement (as a Percent of Matches)According to the Number of Lesions Used for ComputingResponse

1L 2L 4L DIFF(1-2) DIFF(2-4)

93% 75% 100% 0% 83%

1 L/2 L/4L, 1/2/4 lesion(s) chosen by both raters; DIFF(1-2), different

number of lesions chosen varying between 1 and 2;

DIFF(2-4), different number of lesions chosen varying between2 and4.

The best value is shown in bold.

Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS

which means that VBR* presents a positive bias with respect

to RECIST. In other words, for each category of response

provided by VBR*, the response provided by RECIST is

equal or inferior. For example, when a partial response is pro-

vided by VBR*, RECIST provides either partial response or

stable disease; when VBR* is stable disease, RECIST says sta-

ble disease or progressive disease. Of course, this is not valid

for the last row because this matrix is not strictly upper

triangular.

Number of Lesions

Some articles in the literature (31,32) suggest that the number

of lesions is important when computing response. More

specifically, the use of a large number of lesions has been

shown to reduce the interobserver variability. We have

analyzed this aspect for our data by using volume

measurements before consensus to assess if this is actually a

source of variability. This is interesting because the number

of selected lesions is a parameter of lesion selection that

could be potentially controlled during the consensus

process, which would be an interesting extension of the

workflow presented in this article. For example, if there is a

difference in the number of selected lesions between readers

and if an influence on the variability of such differences is

proven, the IR could ask one of the readers to select

additional lesions to match the number of lesions selected by

the second reader.

We have compared the responses between readers by taking

two readers at the time. The results presented in Table 4

confirm that the number of lesions used for computing

response is important for reducing variability. In cases where

four lesions were chosen, 100% of response agreement was

achieved. This supports the claims by Moskowitz et al. (31)

who propose a minimum number of five lesions for avoiding

response misclassification. These results are also in full agree-

ment with Darkeh et al. (32), who discourage the use of less

than four lesions because inter-rater discrepancies may be

introduced.

RECIST Compliance

Before consensus, we have found 12% of response

aluations noncompliant with RECIST, whereas after

consensus all evaluations were RECIST-compliant. The

following RECIST nonconformities have been found as

follows:

� Number of lesions per organ higher than two. For example,

one reader selected three pulmonary TLs, but the

maximum number of lesions per organ established by RE-

CIST is two.

� Lesion location mistakes. For example, pulmonary lesions

labeled as mediastinal.

� Measurability problems. For example, one of the readers

chose lesions with ill-defined contours next to the pulmo-

nary artery, which could not be measured accurately.

Noncompliant RECISTevaluations (and sources of devia-

tion) have been already reported in the literature (33), and

therefore, these results confirm and are consistent with previ-

ously published results.

DISCUSSION

Table 1 suggests that the RECIST-based response agreement is

independent of the consensus on TLs. Indeed, the choice of

the same TLs does not guarantee a lower variability. This sit-

uation may seem to be paradoxical, but the following example

provides an explanation to this observation. Let us consider,

for example, the lesion shown in Figure 7. This lesion is quite

irregular and the measured diameters are very different, which

leads to an also high response variability even if all readers

choose only this lesion as target. On the other hand, the

same Table 1 shows that a consensus on NLs improves the

agreement, which owns probably to the strong effect of this

type of lesions on the response.

Table 1 shows that the consensus on TLs has an impact on the

VBR agreement; however, no differences were observed for

RECIST. These differencesmay be explained by a higher repro-

ducibility of volume measurements with respect to diameter

(5,6,10).The inter-reader agreement forVBR*was significantly

higher thanRECISTandVBR.This is in agreementwith results

published recently by Zhao et al. (13) and Khul et al. (12).

We have found a moderate agreement between VBR* and

RECIST. However, this agreement is not a metric of perfor-

mance because RECIST is not a gold standard for response.

In fact, if a new response index provides the same results as

RECIST (k = 1), it would be completely useless. On the other

hand, k=�1 is not wished either because this would mean that

the new index is saying the opposite thanRECIST. This would

mean in turn that the proposed response index presents major

drawbacks, which would preclude its use in clinical trials.

Therefore, a k = 0.53 seems to be a good compromise because

it implies neither perfect agreement nor completely opposed

results and completely random differences.

Without consensus, the selection of lesions was non-

compliant with RECIST requirements in 12% of cases. The

existence of non-compliances has already been reported by

Skougaard et al. (33) who found 46% of inter-reader discrep-

ancies associated with non-conformities in the application of

RECIST. In this study, 100% of evaluations were RECIST-

223

Page 8: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Figure 7. Differences in assessment of longest axial diameter between readers. The figure shows the slice containing the LAD (blue line) and

the short axis diameter (red line). (a) Reader #1, (b) Reader #2, and (c) independent reviewer. (Color version of figure is available online.)

OUBEL ET AL Academic Radiology, Vol 22, No 2, February 2015

compliant after consensus. This is an important aspect of the

workflow explained in the section on cloud-based setup sys-

tem for quality control purposes.

In this article, we have applied cloud-based solutions for

achieving a consensus on the selection of lesions. However,

the proposed framework can be applied for the standardization

of clinical trials in a broader sense. For example, the lesion seg-

mentation could be performed only on specific images of the

whole set of images corresponding to a given time point.

Another example is the use of similar visualization settings

(zoom, window level, and so forth) for all readers. The number

of lesions used for computing response could also be controlled

because we have observed in the section on number of lesions

that this is a source of variability. These standardizations could

potentially reduce the variability of measurements and can be

easily included as a part of the proposed workflow.

One limitation of this study is the low number of patients of

the data set. This is the result of applying multiple image selec-

tion criteria when creating the data set as described in the sec-

tion on data set. The ideawas to remove themaximum number

of sources of variability related to image characteristics to focus

only on the effects of lesion selection. It is important to take into

account that the statistical analyses were performed on 66 re-

sponses (three time points per patient evaluated by three

readers), which is an acceptable number of statistical units to

perform such analyses. For the purposes of a pilot study like

this, the number of patients included was quite convenient

because it avoided the complexity associated with the manage-

ment of large data sets and to test the workflow promptly.

The consensus process increases necessarily the reviewing

time. However, in some clinical centers, this additional time

is spent anyway in discussions between radiologists and oncol-

ogists to agree on lesion selection. Another aspect to take into

account is that, in general, physicians follow different and

more time-consuming protocols when evaluating patients in

the context of a clinical trial (use of specific software, different

measurements, and filling of case report forms); therefore, the

time invested in performing a consensus seems to be

compliant with clinical trial protocols. Finally, this additional

time may reduce the intersite variability, which in turn may

eventually improve the quality of the generated statistical data.

224

In a real clinical context, several sites must be managed. The

use of cloud solutions reduces the installation complexity

because, differently from a LE approach, it is not necessary to

perform specific installations at each site. However, to manage

the consensus, additional DMs and IRs might be required.

Clouds also allow simplifying the workflow and the communi-

cation between stakeholders. For example, reports can be signed

electronically by the adjudicator and transmitted immediately to

the sponsor, needless to exchange reports in paper format.

Another example is the update of the sponsor’s database: once

the radiologist finishes an evaluation, the site can connect

directly to the sponsor’s database to update it with the results

for the corresponding patient. In this way, data are immediately

available for analysis. The same idea is applicable to images: once

a scan is available for a specific patient, it can be imported into

the system through a computer terminal and then transmitted

to the sponsor to be included in a central database. This solves

the problem of sending images by using support materials like

digital versatile disks (DVDs) or hard drives.

One of the main concerns with cloud solutions is confiden-

tiality. To keep data confidential, the cloud infrastructure and

data were centrally hosted in a data center managed by a

vendor guaranteeing security and confidentiality. The data

center provides early warning systems to identify potential at-

tacks. These types of infrastructures are generally more secure

than hospital infrastructures. Besides, our cloud service re-

quires strong authentication to identify users and provides

an audit trail to record the actions performed by authorized

users. Our system is compliant with regulations such as

CFR21 Part 11 (FDA–Electronic records, electronic signa-

ture) and International Committee for Harmonization–

Good Clinical Practices to guarantee the application of

security, confidentiality, and safety best practices. Finally, the

communication between the client and the server hosted by

the data center is encrypted over the Secure Sockets Layer/

Transport Layer Security protocol. We are aware that cloud

security is a major concern because data are transferred across

the network, and therefore, we use state-of-art technology to

make data transfer as safe as possible. We think that, with the

use of such technology, data are probably as well protected in a

cloud infrastructure as in a local infrastructure.

Page 9: Volume-based Response Evaluation with Consensual Lesion ...€¦ · ing to RECIST 1.1 and VBR criteria by three readers working in different geographic locations. ... ªAUR, 2015

Academic Radiology, Vol 22, No 2, February 2015 RESPONSE EVALUATION USING CLOUDS

CONCLUSIONS

In this article, we have set up a cloud-based system aiming at

standardizing response evaluations in multicenter clinical tri-

als. The use of cloud solutions showed to be an interesting

and feasible strategy for standardizing response evaluations

and reducing variability. The use of VBR increased the

inter-reader agreement compared to RECIST, and therefore,

it seems to be a promising alternative for evaluating the

response to a treatment. For VBR, the consensus improved

the inter-reader response agreement and the RECIST

compliance, and therefore, it is an interesting phase to be

considered in a clinical trial protocol. Finally, the consensus

on NLs showed to play an important role because of their

strong impact on the global response, and therefore, it deserves

special attention as a topic of future research.

REFERENCES

1. Globocan 2012: Estimated incidence, mortality, and prevalence world-

wide in 2012. International Agency for Research on Cancer. Available

from: http://globocan.iarc.fr; 2012.

2. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate

the response to treatment in solid tumors. J Natl Cancer Inst 2000 Feb;

92(3):205–216. Available from: http://jnci.oxfordjournals.org/content/92/

3/205.abstract.

3. Eisenhauera EA, Therasse P, Bogaerts J, et al. New response evaluation

criteria in solid tumours: revised RECIST guideline (version 1.1). Eur J Can-

cer 2009 Jan;45(2):228–247. Available from: http://www.ncbi.nlm.nih.gov/

pubmed/19097774.

4. ErasmusJJ,GladishGW,BroemelingL, et al. Interobserver and intraobserver

variability inmeasurement of non–small-cell carcinoma lung lesions: implica-

tions for assessment of tumor response. J Clin Oncol 2003 Jan;21(13):

2574–2582. Available from, http://jco.ascopubs.org/content/21/13/2574.

5. Suzuki C, Jacobsson H, Hatschek T, Torkzad MR, Bod�en K, Eriksson-Alm

Y, et al. Radiologic measurements of tumor response to treatment: prac-

tical approaches and limitations. Radiographics [Internet]. [cited 2013

Nov 27];28(2):329–344. Available from: http://www.ncbi.nlm.nih.gov/

pubmed/18349443

6. Nishino M, Jagannathan JP, Ramaiya NH, et al. Revised RECIST guideline

version 1.1: what oncologists want to know and what radiologists need to

know. American Roentgen Ray Society. AJR Am J Roentgenol 2010 Aug

23;195(2):281–289. Available from:, http://www.ajronline.org/doi/full/10.

2214/AJR.09.4110.

7. Buckler AJ, Mozley PD, Schwartz L, et al. Volumetric CT in lung cancer: an

example for the qualification of imaging as a biomarker. Acad Radiol 2010

Jan;17(1):107–115. Available from: http://www.ncbi.nlm.nih.gov/pubmed/

19969254.

8. Buckler AJ, Mulshine JL, Gottlieb R, et al. The use of volumetric CT as

an imaging biomarker in lung cancer. Acad Radiol 2010 Jan 1;17(1):

100–106. Available from: http://www.academicradiology.org/article/

S1076-6332(09)00574-1/abstract.

9. Seyal AR, Parekh K, Velichko YS, et al. Tumor growth kinetics versus recist

to assess response to locoregional therapy in breast cancer liver metasta-

ses. Acad Radiol. Available from: http://www.ncbi.nlm.nih.gov/pubmed/

24833565; 2014 May 12.

10. Mozley PD, Bendtsen C, Zhao B, et al. Measurement of tumor volumes im-

proves RECIST-based response assessments in advanced lung cancer.

Transl Oncol 2012 Feb;5(1):19–25. Available from: http://www.ncbi.nlm.

nih.gov/pubmed/22348172.

11. Quantitative Imaging Biomarkers Alliance [Internet]. Available from:

https://www.rsna.org/QIBA.aspx

12. Khul CK, Barabasch A, Dirrichs T, et al. Target lesion selection as a source

of variability of response classification by RECIST 1.1. 2013 ASCO Annual

Meeting. J Clin Oncol 2013; 31 2013;5(suppl). abstr 11077.

13. ZhaoB,LeeSM,LeeH-J,etal.Variability inassessing treatment response:met-

astatic colorectal cancer as a paradigm. Clin Cancer Res 2014 Jul 1;20(13):

3560–3568. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24780294.

14. Padhani AR, Ollivier L. The RECIST criteria: implications for diagnostic ra-

diologists. Br J Radiol 2001 Nov 1;74(887):983–986. Available from: http://

bjr.birjournals.org/content/74/887/983.full.

15. Schuetze SM, Baker LH, Benjamin RS, et al. Selection of response criteria

for clinical trials of sarcoma treatment. Oncologist 2008 Jan 1;13(Suppl 2):

32–40. Available from: http://theoncologist.alphamedpress.org/content/

13/suppl_2/32.full.

16. Dodd LE, Korn EL, Freidlin B, et al. Blinded independent central

review of progression-free survival in phase III clinical trials: impor-

tant design element or unnecessary expense? J Clin Oncol 2008

Aug 1;26(22):3791–3796. Available from: http://www.pubmedcentral.

nih.gov/articlerender.fcgi?artid=2654812&tool=pmcentrez&rendertype=

abstract.

17. Shamsi K, Patt RH. Onsite image evaluations and independent image

blinded reads: close cousins or distant relatives? J Clin Oncol 2009 Apr

20;27(12):2103–2104. author reply 2104–5. Available from, http://jco.

ascopubs.org/content/27/12/2103.

18. Ford R, Schwartz L, Dancey J, et al. Lessons learned from independent

central review. Eur J Cancer 2009 Jan;45(2):268–274. Available from:

http://www.ncbi.nlm.nih.gov/pubmed/19101138.

19. Zhang JJ, Chen H, He K, et al. Evaluation of blinded independent central

review of tumor progression in oncology clinical trials: a meta-analysis.

Ther Innov Regul Sci 2012 Sep 7;47(2):167–174. Available from: http://

dij.sagepub.com/content/47/2/167.short.

20. Dodd LE, Korn EL, Freidlin B, et al. An audit strategy for progression-free

survival. Biometrics 2011 Sep;67(3):1092–1099. Available from: http://

www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3160504&tool=pm

centrez&rendertype=abstract.

21. Amit O, Mannino F, Stone AM, et al. Blinded independent central review of

progression in cancer clinical trials: results from a meta-analysis. Eur J

Cancer 2011 Aug;47(12):1772–1778. Available from: http://www.ncbi.

nlm.nih.gov/pubmed/21429737.

22. Armato SG, Meyer CR, Mcnitt-Gray MF, et al. The Reference Image Data-

base to Evaluate Response to therapy in lung cancer (RIDER) project: a

resource for the development of change-analysis software. Clin Pharma-

col Ther 2008 Oct;84(4):448–456. Available from: http://www.ncbi.nlm.

nih.gov/pubmed/18754000.

23. HermanGT, Zheng J, Bucholtz CA. Shape-based interpolation. IEEECom-

put Graph Appl 1992; 12.

24. Fleiss JL. Measuring nominal scale agreement among many raters. Psy-

chol Bull 1971;76(5):378–382. Available from: http://content.apa.org/

journals/bul/76/5/378.

25. Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with

continuous measurements. J Biopharm Stat 2007 Jan;17(4):529–569.

Available from: http://www.ncbi.nlm.nih.gov/pubmed/17613641.

26. Royston JP. An extension of Shapiro andWilksW test for normality to large

samples. Appl Stat 1982; 31:115–124.

27. Beaumont H, Labatte J-M, Souchet S, et al. Determination of thresholds

for the assessment of response to therapy relying on volume-based tumor

burden in pulmonary CT images. ASCO Meet Abstr 2013;31(15_suppl):

e18508. Available from: http://meeting.ascopubs.org/cgi/content/abstract/

31/15_suppl/e18508.

28. R Development Core Team. R: a language and environment for statistical

computing. R Foundation for Statistical Computing; 409. Available from:

http://www.r-project.org; 2013.

29. Gamer M. Package irr. Available from: http://cran.r-project.org/web/

packages/irr/; 2012.

30. Landis JR, Koch GG. The measurement of observer agreement for cate-

gorical data. Biometrics 1977 Mar;33(1):159–174. Available from: http://

www.ncbi.nlm.nih.gov/pubmed/843571.

31. Moskowitz CS, Jia X, Schwartz LH, et al. A simulation study to evaluate the

impact of the number of lesions measured on response assessment. Eur

J Cancer 2009 Jan;45(2):300–310. Available from: http://www.pubmed

central.nih.gov/articlerender.fcgi?artid=2652848&tool=pmcentrez&render

type=abstract.

32. Darkeh MHSE, Suzuki C, Torkzad MR. The minimum number of target

lesions that need to be measured to be representative of the total number

of target lesions (according to RECIST). Br J Radiol 2009 Aug;82(980):

681–686. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1936

6735.

33. Skougaard K, McCullagh MJD, Nielsen D, et al. Observer variability in a

phase II trial—assessing consistency in RECIST application. Acta Oncol

2012 Jul;51(6):774–780. Available from: http://www.ncbi.nlm.nih.gov/

pubmed/22432439.

225


Recommended