1
Study Design and Hypothesis Testing in Clinical Research
Jonathan J. Shuster, Ph.D ([email protected])
Research Professor of Biostatistics
Univ. of Florida, College of Medicine
2
Take-home Messages
• Rely on Evidence-Based Medicine. Conventional wisdom can easily lead us astray.
• The objective of Statistics is to make informed inferences about a population, based on a sample. It is imperative to quantify the uncertainty.
• The P-value is a quantity that allows us to infer something about whether a scientific hypothesis is false.
• Non-significant results are inconclusive • Randomization and intent-to-treat are vital
components in sound clinical research
3
4
Topics
1. Motivating Evidence-Based Clinical Studies
2. Objective of Statistics
3. Hypothesis testing and P-values
4. Real Examples and their lessons
5
6
1. Motivating Evidence-Based Medicine
• A coin is “loaded”, with a 70% chance of landing heads. One player picks a three outcome sequence (e.g. HTH), then the other picks a different sequence. Whoever’s sequence comes up first is the winner.
• Do you want to choose first, and if so, what sequence to you select?
7
Evidence-Based Medicine
• So you decided to go first and pick HHH, right?• OK, I pick THH.• HHH can only occur before THH if it is on the
first three flips. (If the first time HHH occurs is flips 6,7,8 then flip 5 is T, so flips 5,6,7 are THH, I win. (I make your first 2, my last 2, so I tend to stay ahead.)
• Your chance of winning=.73 =.343 (34.3%)
8
Evidence-Based Medicine
• Lesson from this example.
• Things are not always what they seem. You need to be a healthy skeptic.
• Reference: Shuster, J. A two-player coin game paradox in the classroom. American Statistician, 2006(Feb), vol 60, pp 68-70.
9
10
2. Objective of Statistics
• To make an inference about a defined target population from a representative sample.
• That is, for us, to start from a medical hypothesis about a medical condition, help design a study that can collect data to test the question, and draw conclusions. Quantifying the uncertainty about the inference is a key part.
11
2. Comment on This
• Should we compare treatment groups statistically in a randomized study with respect to baseline parameter (e.g. age, gender, ethnicity, blood pressure)?
12
2. Provenzano: Clin J Am Soc Nephrol 4, 386-93, 2009
• “Baseline characteristics were similar except for more men in the oral iron group compared with the ferumoxytol group (62.9% versus 50.0%, P 0.04). Mean baseline laboratory measures were similar between the two treatment groups.”
13
2. Comment on This
• For hypothesis driven research, should we test for normality before using a t-test, and if we reject try to transform the data?
14
Nissen Article
• JAMA. 2008;299(13):1561-1573. Comparison of Pioglitazone vs Glimepiride on Progression of Coronary Atherosclerosis in Patients With Type 2 Diabetes
• ‘For continuous variables with a normal distribution, the mean and 95% confidence intervals (CIs) are reported. For variables not normally distributed, median and interquartile ranges are reported and 95% CIs around median changes were computed using bootstrap resampling.’ (N=273 vs 270 in groups)
15
2. Testing Assumptions
Diagnostic Test
Passes Fails
16
17
3. Testing a Hypothesis (P-Value)
• Put a statement on Trial: “Null Hypothesis”
• ISIS #2 (International Sudden Infarct Study #2): The five week mortality rates for Streptokinase and Placebo are equivalent in patients with recent MIs
• Results: Strep(791/8592=9.2%) vs. Plac(1029/8595=12.0%)
18
3. P-Value
• P=3.8* 10-9
• If you replicated the experiment in a population where the null hypothesis was true, there is a 3.8 in a billion chance of seeing a difference at least as extreme in either direction (2-sided)
19
3. ISIS #2 Reference
• ISIS #2 Collaborative Group. (1988) Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of acute myocardial infarction: ISIS 2, Lancet 2: 349-360.
20
3. P-Value and Proof by Contradiction
• What is the probability that if you replicated your experiment in a target population where your null hypothesis is true that you would see differences at least as extreme as what you actually observed. If this value (the p-value) is small it is evidence against this null hypothesis.
• Analogy is beyond a reasonable doubt. Science uses 5% arbitrarily as “reasonable” doubt in most cases.
21
3. Was this overkill in terms of sample size
• Suppose the results were 79/859 vs. 103/860 (same percentages of 9.2% vs. 12.0% but with one tenth the sample size).
• Now P=0.071 (7.1%), and would not be statistically significant. Would we be using this clot buster today? It was the biostatistician, Sir Richard Peto who determined this sample size.
22
3. ISIS #2:
• Any other questions about the study?
23
3. ISIS #2 Issues
• Who was watching the store. Accrual took 3.5 years and outcome was known for each patient within five weeks.
• Always report a sample size justification in your papers (Provenzano, slide 12, did not).
24
4. Real Example
• Coronary Drug Project
25
The Coronary Drug Project Research Group (1980)
• Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. NEJM 303: 1038-1041.
• Double blind randomized study of Clofibrate vs. Placebo in men who had prior MI.
26
Compliers vs. Not on Drug
Coronary Drug Project
0
5
10
15
20
25
C_Drug NC_Drug
5Yr
Mo
rtal
ity(
%)
C_Drug
NC_Drug
27
Compliers vs. Not
28
Drug vs. Placebo
29
Coronary Drug Project Take home Message
What can this study teach us about Clinical Studies?
30
Intent-to-Treat
• The gold standard for analyzing randomized clinical trials is Intent-to-treat. Patients are analyzed in the groups they were assigned to, irrespective of what they actually received.
31
32
4. Real UF Example:
• Effectiveness of Nesiritide on Dialysis or All-Cause Mortality in Patients Undergoing Cardiothoracic Surgery. Clinical Cardiology. 2006; Jan;29(1):18-24. with T. Beaver et. al.
• Motivation: Shands impression was that it was harmful and costly.
33
4. Nesiritide Example
• Study Null Hypothesis: 20 day death/dialysis rate in patients getting nesiritide within two days of surgery have the same death rate as “similar” patients not getting it.
• Design Suggestions?
34
4. Possible Designs (+/-)
• Observational: Historical Control (Compare period before drug) to period after drug started to be given to a sizable fraction (gap during ramping up of use). Must include all comers and use electronic chart review.
• Observational: Compare those getting to those not getting the drug.
• Randomized controlled prospective trial
35
4. Sources of Variation
• Within treatments, why might we not get the same result for every patient?
• Historical Control?
• Comparing concurrent nesiritide vs. not?
• Randomized prospective trial?
36
4. Sources of Bias (Confounders)
• Why might we see differences that might be totally unrelated to the treatment (nesiritide vs. not)?
• Historical Control?
• Comparing concurrent nesiritide vs. not?
• Randomized prospective trial?
37
4. Nesiritide: Propensity Scoring
• Actual Design: Compared Nesiritide vs. Not by Propensity Score Matching.
• Using 12 key covariates, we estimated the probability that a patient would get Nesiritide given these covariates. Then we matched the nesiritide patients to non-nesiritide patients for the propensity, and did a matched analysis.
38
4. Conclusions
• Nesiritide showed no significant difference (inconclusive) within CABG patients,
• Nesiritide showed promise in aneurysm subjects with baseline elevated SCR, but was inconclusive in other such patients.
• Run a future randomized double-blind trial in aneurisms with elevated SCR (Just completed and close to being in press with an inconclusive result.)
39
4. Conclusion (continued)
• Note that the Shands study data were very important in designing the randomized follow-up study, in terms of the number of subjects needed (power analysis).
40
Take-home Messages
• Rely on Evidence-Based Medicine. Conventional wisdom can easily lead us astray.
• The objective of Statistics is to make informed inferences about a population, based on a sample. It is imperative to quantify the uncertainty.
• The P-value is a quantity that allows us to infer something about whether a scientific hypothesis is false.
• Non-significant results are inconclusive • Randomization and intent-to-treat are vital
components in sound clinical research
41
Design One Together
• Medical Question: Does Caffeine Withdrawal cause Headaches?
42
Eligibility
43
Design
• What are the sources of variation besides caffeine consumption?
• How do we control caffeine consumption
• Should we use deception—hide purpose of study? Is this ethical?
44
Design
• Pre-Post?
• Double Blind Parallel Study?
• Double Blind Crossover Study?
45
Forensics for Irregularity
Phenylephrine
46
Phenylephrine Crossover Studies
47
Phenylephrine (Baseline NAR)Study (10 mg vs Placebo)
Std Dev CV=100SD/Mean
1 (N=16) (EB) 2.0 15.3%
2 (N=10) (EB) 0.9 6.7%
3 (N=16) 7.8 36.3%
4 (N=15) 9.5 35.6%
5 (N=16) 6.2 29.3%
6 (N=16) 9.8 40.4%
7 (N=14) 9.4 35.3%
48
How do we test for Data Irregularities?
• Background: Baseline NAR (Nasal Airway resistance) measures are typically xx.x (e.g. 20.2), and are always based on the mean of 10 observations (5 from each nostril).
• What null hypothesis can we test to find potential irregularities? What P-value might we use to declare significance?
49
Baseline Last Digit (3rd sign)
Study 1 Study 2
0:2 5
1:4 2
2:2 1
3:6 9
4:2 4
5:23 7
6:8 5
7:9 10
8:3 3
9:5 4
50
• Thank You!!
51
Coronary Drug ProjectCoronary Drug Project Data
Five Year Mortality (Clofibrate)
• Compliers: 15.0% (15.7%) (N=708)
• Non-Compliers: 24.6%(22.5%) (N=357)
• Compliers took >80% of their meds to death or to 5 years whichever was first.
• In () is 5 year mortality, adjusted for prognostic factors.
52
Coronary Drug Project
Five Year Mortality (Placebo)
• Compliers: 15.1% (16.4%) (N=1813)
• Non-Compliers: 28.2%(25.8%) (N=882)
• Compliers took >80% of their meds to death or to 5 years whichever was first.
• In () is 5 year mortality, adjusted for prognostic factors.
53
Coronary Drug Project
Five-year mortality (As randomized)
• Clofibrate: 20.0% (N=1103)
• Placebo: 20.9% (N=2789)
• NB: Compliance could not be assessed in a small number of patients.