Evaluation of Diagnostic Tests July 18, 2011
Introduction to Clinical Research:A Two-week Intensive Course
Milo A. Puhan, MD, PhD Department of Epidemiology
Johns HopkinsBloomberg School of Public Health
1
Today’s learning objectives
To describe the use of diagnostic tests in practice
To know about the phases of diagnostic test evaluation
To know an approach for designing randomized trials for diagnostic test evaluation
2
To know about biases that affect diagnostic test accuracy studies
Today’s key messages
Diagnostic tests are used for screening, as add-on tests, for triage, as replacement tests or for monitoring.
Diagnostic test evaluation includes (at least) test accuracy studies,health outcomes studies and cost effectiveness studies.
3
Biases related to the spectrum of patients and to the reference standard affect estimates of diagnostic test accuracy most.
To ensure that informative RCTs, identify the critical comparisons between the old and new test-treatment.
Recommended books
Evidence Base of Clinical DiagnosisTheory and Methods of Diagnostic Researchby Andre Knottnerus and Frank Buntinx (Editor), Publisher: Wiley, John & Sons, Pub. Date: November 2008 ISBN-13: 9781405157872
4
Evidence-Based DiagnosisThomas B. Newman and Michael A. KohnPublisher: Cambridge University PressPub. Date: 2009ISBN: 978-0-521-71402-0
5
Part I: Stages of diagnostic test evaluation
Brain natriuretic peptide (BNP) for diagnosing heart failure
BNP
Toma et al Cardiovascular Medicine 2007;10:27–33
Cut-offs between 15 and 100 pg/ml used
6
7 7
Diagnostic test should reduce uncertainty
Example 1: BNP for diagnosing heart failure in patients with dyspnea in ER
Pre-test probability Diagnostic test Post-test probability
20% BNP - 2%
100%
50%
0%
Pretest probability
Probability of target condition before testing
100%
50%
0%
Posttest probability
Probability of target condition after testing
Test
Diagnostic test (may) have an indirect and direct impact on health outcomes
8
9
Early diagnostic work-up
Stage of clinical management
Setting Purpose of test
Primary careER
Available information
Patient historyPhysical exam
ScreeningTriage
Use of a test (BNP) in practice
Monitor disease processUnder treatment
Primary orspecialized care
Diagnostic work-up + treatments
100%
50%
0%
Prognostic information
Diagnosis establishedER
Specialized care+
echocardiography
100%
50%
0%
Diagnosis not yet established ER
+Chest X-ray
ECG
Add-onReplacement
100%
50%
0%
100%
50%
0%
BNP used for triage to avoid unnecessary work-up
Heart failure among differential diagnoses after taking patient history and physical exam
100%
50%
0%
Positive test result Negative test result 100%
50%
0%
100%
50%
0%
Heart failure work-up± treatment
Consider other diagnoses
Health outcomes 10
BNP used as add-on test
Heart failure suggested after patient history, physical exam, ECG and Chest x-ray
100%
50%
0%
Positive test result Negative test result 100%
50%
0%
100%
50%
0%
Heart failure work-up± treatment
Consider other diagnoses
Health outcomes11
BNP used as replacement test for echocardiography
Heart failure suggested after patient history, physical exam, ECG and Chest x-ray
100%
50%
0%
Positive test result Negative test result 100%
50%
0%
100%
50%
0%
Treatment Consider other diagnoses
Health outcomes12
BNP used as a prognostic marker
AgeGenderExacerbations of heart failureDyspnea
Risk of 5-year mortality
0-10%>10-20%>20-30%>30%
+
Improved prediction by adding BNP?
13
BNP used for monitoring
Heart failure diagnosis and treatment established
Not in therapeutic range In therapeutic range
Adapt treatment Treatment unchanged
14
Test phases for diagnostic tests
Investigates whether test results are different for patients ±disease
Phase I
Investigates whether patients with disease are more likely to have positive test results compared to patients without disease
Phase II
Investigates how well the test distinguishes between patients ± disease in patients suspected of having the disease
Phase III
Investigates whether using the test leads to better health outcomes
Phase V
Phase VI Investigates whether using the test leads to better health outcomes at acceptable costs
Phase IV Investigates how informative a test is considering additional information available at the moment of testing.
15
Phases of diagnostic test evaluation
Phase I
Patient with heart failure
Healthy subjects
Phase IIPatient with heart failure
Healthy subjects
90 20
10 80
Test positive
Test negative
PPV: 82%
NPV: 89%
Sens: 90% Spec: 80%
DOR: 36
+LR: 4.5
-LR: 0.13
16
Phase IV
Patients suspected of having disease
Age, gender, smoking and coronary heart disease status known
Probability of heart failure?
Phase III
heart failure no heart failure
85 250
15 650
Test positive
Test negative
PPV: 25%
NPV: 96%
Sens: 85% Spec: 72%
DOR: 15
+LR: 3.0
-LR: 0.21
Phases of diagnostic test evaluation
Patients suspected of having disease
17
Phase V
Patients suspected of having heart failure
Health outcomeR
± Treat and follow-up
+
-
Treat and follow-upRandomized trial
or
Patients suspected of having heart failure
Before-after study
until to 2000
± Treat and follow-up
from 2001
Health outcome
Patients suspected of having heart failure
+- Treat and follow-up
Phases of diagnostic test evaluation
18
Phase VI
Patients suspected of having heart failure
Health outcome
+ costs
R
± Treat and follow-up
+
-
Treat and follow-upRandomized trial
Phases of diagnostic test evaluation
19
Outcomes for phase V and VI studies
20
Study designs for the evaluation of diagnostic tests
Cross-sectional case-control study (pro- or retrospective)
Cross-sectional study of patients suspected of having disease (pro- or retrospective)
Randomized trial or before-after study
Phase I
Cross-sectional case-control study (pro- or retrospective)
Phase II
Phase III
Phase IV
Phase V Cost effectiveness study
Systematic review
Cross-sectional study of patients suspected of having disease
Phase IV
21
Test phases are not well established for diagnostic studies
22
Clinical Trials
Phase 1 Phase 2 Phase 3 Phase 4
- Safety (maximum tolerated dose)
- Pharmacokinetics
- Prelim. Efficacy- Dosage response
- Efficacy- Clinically relevant
effects
- Safety surveillance
Diagnostic studies
19 models have been proposed…
Lijmer et al. Med Decis Making 2009; 29; E13
23
Synthesis of models of diagnostic test evaluation phases
Lijmer et al. Med Decis Making 2009; 29; E13
Technical requirements
Test accuracy
Effects on decisions
Effects on patient outcomes
Effects on health care system
24
Part II: Biases in diagnostic test accuracy studies
The diagnostic test accuracy study
Aim: To obtain (unbiased) estimates of diagnostic test accuracy such as sensitivity, specificity, likelihood ratios, etc
Patients suspected of having disease
Study design: Cross-sectional study with patients suspected of having disease (phase III)
Index test Reference test
Patient with disease
Patients without disease
85 250
15 650
Test positive
Test negative
25
The “perfect”diagnostic test accuracy study
Patients suspected of having disease
Index test
Reference test
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described (setting)- Prior information available (comprehensive
ascertainment of patient history, exam, tests)
- Aim of test (triage, replacement, add-on)- Well defined protocol- Well defined threshold- Performed for all patients- Maximized reliability (intra- and inter-rater)- Blinded towards reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test 26
Sources of bias in diagnostic test accuracy studies
Patients suspected of having disease
Index test
Reference test
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described (setting)- Prior information available (comprehensive
ascertainment patient history, exam, tests)
- Aim of test (triage, replacement, add-on)- Well defined protocol- Well defined threshold- Performed for all patients- Maximized reliability (intra- and inter-rater)- Blinded towards reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
27
Sources of variability in diagnostic test accuracy studies
Patients suspected of having disease
Index test
Reference test
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described (setting)- Prior information available (comprehensive
ascertainment patient history, exam, tests)
- Aim of test (triage, replacement, add-on)- Well defined protocol- Well defined threshold- Performed for all patients- Maximized reliability (intra- and inter-rater)- Blinded towards reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test 28
Bias from definition or recruitment of population - Spectrum bias
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Setting: Outpatient cardiology clinic
Patients: referred from primary care with suspected new heart failure
Clinical manifestations
None Very severe
Number of patients
x x
x
x
x
x
29
Bias from definition or recruitment of population -Spectrum bias
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Setting: Outpatient cardiology clinic
Patients: referred from primary care with suspected new heart failure
Clinical manifestationsNone Very severe
BNP levels
30
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Setting: Outpatient cardiology clinic
Patients: referred from primary care with suspected new heart failure
Clinical manifestationsNone Very severe
BNP levels
Bias from definition or recruitment of population -Spectrum bias
31
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Setting: Outpatient cardiology clinic
Patients: referred from primary care with suspected new heart failure
Clinical manifestationsNone Very severe
BNP levels
Heart failure No heart failure
9 8
2 23
Test +
Test -
Sens: 82% Spec: 74%
Bias from definition or recruitment of population -Spectrum bias
32
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Setting: Outpatient cardiology clinic
Patients: referred from primary care with suspected new heart failure
Clinical manifestationsNone Very severe
BNP levels
Heart failure No heart failure
5 2
1 10
Test +
Test -
Sens: 83% Spec: 83%
Bias from definition or recruitment of population -Spectrum bias
33
Eissa et al. J Urol 2010;183:493-498
Healthy controls frequently included in DTA studies
Mao et al. Gut 2010;59:1687e1693.
Often strong conclusions despite being phase II studies
Empirical Evidence of Design-Related Bias in Studies of Diagnostic Tests
Lijmer, J. G. et al. JAMA 1999;282:1061-1066.
Meta-analyses of at least 5 test accuracy studies
PubMed, EMBASE, DARE, Cochrane
18 meta-analyses found including 193 studies
Quality assessment + diagnostic odds ratio (DOR)
80 50
20 450
80*450
20*50= 36
Association of quality of studies with diagnostic odds ratio
DOR low quality= relative DOR
DOR high quality
36
Lijmer, J. G. et al. JAMA 1999;282:1061-1066.
Relative Diagnostic Odds Ratios of the 9 Study Characteristics
37
Patients suspected of having disease
- Spectrum of patients adequate for setting and intention-to-diagnose
- Prospective data collection- Consecutive recruitment- Referral filter described- Prior information available (comprehensiveascertainment patient history, exam, tests)
Clinical manifestationsNone Very severe
BNP levels
No verification of disease status
Validated BNP measurements?
Same reference standard for all?
Same assessors for reference standard for all?
Bias from definition or recruitment of population –Prospective vs retrospective
38
Lijmer, J. G. et al. JAMA 1999;282:1061-1066.
Relative Diagnostic Odds Ratios of the 9 Study Characteristics
39
Index test
- Aim of test (triage, replacement, add-on)- Well defined protocol- Well defined threshold- Performed for all patients- Maximized reliability (intra- and inter-rater)- Blinded towards reference test
Heart sound level
Bias from index test - Test review bias (blinding)
Blinding No blinding
40
Lijmer, J. G. et al. JAMA 1999;282:1061-1066.
Relative Diagnostic Odds Ratios of the 9 Study Characteristics
41
Biases from reference test
Reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
None Very severe
Challenges for verification of disease status by reference test
- Partial verification bias
- Differential verification bias
- Disease progression bias
- Incorporation bias
- Blinding – Diagnosis review bias
42
Reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
None Very severe
How to define heart failure?
- Clinical criteria?
- Echocardiography?
- Response to treatment?
Biases from reference test
43
Disease progression bias
Reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
Patients suspected of having disease
Index test
Reference test
Delay of reference test
- Disease status may have changed
- Particularly problematic for acute diseases
(infections)
- Problematic for chronic diseases if prognostic
criteria used as reference standard
44
Partial verification bias
Reference test
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
- Missed to perform reference test
randomly
No verification of disease status
in patients with lower
disease probability
45
Partial verification bias
Reference test
Heart failure No heart failure
9 8
2 23
Test +
Test -
Sens: 82% Spec: 74%
Heart failure No heart failure
7 4 6
2 4 19
Test +
Test -
Unclear
Sens: 78% Spec: 76%
Complete case analysis If uncorrelated to Test +/- tends to underestimate accuracy
prevalence low sensitivity more affected
prevalence high specificity more affected
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
46
Partial verification bias
Reference test
Heart failure
No heart failure
7 4 6
2 4 19
Test +
Test -
UnclearHeart failure No heart
failure
11 6
2 23
Test +
Test -
Sens: 85% Spec: 76%
Add prognostic criterion: Was heart failure diagnosed at later stage?
Heart failure No heart failure
7 4 6
2 4 19
Test +
Test -
UnclearHeart failure No heart
failure
10 7
3 22
Test +
Test -
Sens: 77% Spec: 76%
3 1
1 3
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
47
48
Rutjes et al. Health Tech Ass 2007; Vol. 175, 1605-1612
Methods to correct for partial and differential verification bias
Incorporation bias
Reference test
Index test
Diagnosis of multiple sclerosis
Reference test
MRI Clinical follow-up, cerebrospinal fluid + MRI
- Good measure of target condition- Well defined protocol- Performed for all patients- Performed at same time as index test- Maximized reliability (intra- and inter-rater)- Blinded towards index test
49
Lijmer, J. G. et al. JAMA 1999;282:1061-1066.
Relative Diagnostic Odds Ratios of the 9 Study Characteristics
50
Solutions to minimize bias from reference standard
- Case definition- Intra- and inter rater reliability
Sources of bias
- Partial verification bias- Differential verification bias
- Double/triple reading- Adjudication committee (expert panels)
Solutions
- Incorporation bias and diagnosis review bias
- Ensure quality control for data collection- Use “realistic” reference tests- Foresee missings and consider
prognostic criteria- Statistical methods
- Appropriate case definition- Ensure blinding
- Applies to all - Learn from previous studies- Do systematic review- Write protocol- Take time and find consensus
- Disease progression bias - No relevant time delay between index and reference test
51
52
Part III: Randomized trials for diagnostic test evaluation
Test phases for diagnostic tests
Investigates whether test results are different for patients ±disease
Phase I
Investigates whether patients with disease are more likely to have positive test results compared to patients without disease
Phase II
Investigates how well the test distinguishes between patients ± disease in patients suspected of having the disease
Phase III
Investigates whether using the test leads to better health outcomes
Phase V
Phase VI Investigates whether using the test leads to better health outcomes at acceptable costs
Phase IV Investigates how informative a test is considering additional information available at the moment of testing.
53
Phase V
Patients suspected of having heart failure
Health outcomeR
± Treat and follow-up
+
-
Treat and follow-upRandomized trial
Phases of diagnostic test evaluation
The critical questions when assessing patient outcomes
55
What is the intended incremental value of the test on outcomes (short- and long-term patient outcomes and costs)?
What type of evidence is needed to assess this incremental value?
Recommended approach
Define the purpose of the test
Display the existing test-treatment strategy
Display the new test-treatment strategy
Identify the critical comparison to assess the incremental value
Assess whether existing evidence suffices or if RCTs are required
Lord et al. Med Dec Making 2009;29:E1
Test-treatment strategy for replacement tests
Target populationPrior tests
Existing test
Test result
Management
Test pos pathway TF FP
Test neg pathway TN FN
Test safety & other attributes?
Sensitivity and specificity?
Change in management?
Treatment effects?
Patient outcomes
Test neg pathway TN FN
New test
Test result
Management
Test pos pathway TF FP
Example: Liquid-based cytology to replace Pap smear for cervical cancer screening in order to reduce repeated testing (poor Pap smear quality)
Target population
Pap smear LBCTest procedure identical
for women
Reference standard for both: Biopsy
SR show: Sensitivity and specificity very similar
Test result Test result
Management Management
Test pos pathway
Test neg pathway
No change in management
Treatment effects not different
Patient outcomes
Test pos pathway
Test neg pathway
test again test againRCT to compare
short-term effects from
testing
No long-term RCT needed
Test-treatment strategy for add-on tests
Target populationPrior tests
Existing test Existing test
Test safety & other attributes?
Test result
Management
Management
Test pos pathway A
Test neg pathway B
Sensitivity & specificity
Treatment effects?
Patient outcomes
Test pos pathway A
Test neg Add-on test
Test result
Test neg pathway B*
Test result
Management
Test pos pathway A*
Treated populations?
Example: MRI as add-on test to mammography and ultrasound in breast cancer screening to detect extra cases and inform decision on type of surgery
Target populationPrior tests
Mammography + ultrasound
MRI more sensitive more cases and detects multifocal
disease
Test result
Management
Management
Test pos BCS or
mastectomy
Test neg continue screening
Treatment effects unclear
Patient outcomes
Test pos BCS or
mastectomy
Test neg MRI
Test result
Test negcontinue screening
Test result
Management
Test pos BCS or
mastectomy*
Treated populations
different
Mammography + ultrasound RCT to compare
short-term effects from
testing
RCT to compare long-term effects of different
treatments (different
surgery and populations)
Test-treatment strategy for triage tests
Target populationPrior tests
Existing test Triage testTest safety & other attributes?
Test result
Management
Management
Test pos pathway A
Test neg pathway B
Sensitivity & specificity
Treatment effects?
Patient outcomes
Test neg pathway B*
TN FN
Test result
Treated populations?
Test pos Add existing test
Test neg pathway B
Test result
Management
Test pos pathway A
Target populationPrior tests
Ultrasound D-DimerPatient convenience
Test result
Management
Management
Test pos pathway A
Test neg pathway B
Same treatment effects
Patient outcomes
Test neg pathway B*
Test result
Same treated populations
Test pos Add ultrasound
Test neg pathway B
Test result
Management
Test pos pathway A
Example: Triage D-Dimer test to reduce the number of ultrasounds in patients at low risk for DVT
RCT to compare short-term
effects from testing
RCT to compare long-term
effects may not be necessary
D-Dimer sensitivity >98%
Use of randomized trials for test evaluation
62
Define the purpose of the test
Display the existing test-treatment strategy
Display the new test-treatment strategy
Identify the critical comparison to assess the incremental value
Assess whether existing evidence suffices or if RCTs are required
Today’s key messages
Diagnostic tests are used for screening, as add-on tests, for triage, as replacement tests or for monitoring.
Diagnostic test evaluation includes (at least) test accuracy studies,health outcomes studies and cost effectiveness studies.
63
Biases related to the spectrum of patients and to the reference standard affect estimates of diagnostic test accuracy most.
To ensure that informative RCTs, identify the critical comparisons between the old and new test-treatment.