Sted og dato (Indsæt --> Diasnummer) Dias 1
Department of Biostatistics
Volkert Siersma [email protected]
The Research Unit for General Practice in Copenhagen
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Sted og dato (Indsæt --> Diasnummer) Dias 2
Department of Biostatistics
Content
Comparison of two groups.
In particular:
• The simple t-test
• Unequal variance
• Non-normal
• Need for transformation
• Paired data
• Multiple testing
Sted og dato (Indsæt --> Diasnummer) Dias 3
Department of Biostatistics
Example
In a genetic inheritance study1, blood samples samples of
individuals from several ethnic groups were taken. We shall
compare the groups labeled “Native American” and “Caucasian”
with respect to the variable MSCE (mean sister chromatid
exchange).
_____________ 1Margolin (1988) Statistical Science,3,351–357
Sted og dato (Indsæt --> Diasnummer) Dias 4
Department of Biostatistics
Example – Data
Caucasian Native American
8.27 8.50
8.20 9.48
8.25 8.65
8.14 8.16
9.00 8.83
8.10 7.76
7.20 8.63
8.32 -
7.70 -
Average: 8.13 8.57
Sted og dato (Indsæt --> Diasnummer) Dias 5
Department of Biostatistics
Example – Visualization
Measurements of MSCE.
Caucasians in red & Native Americans in blue.
Group averages are indicated by arrows.
But what is the question…?
Sted og dato (Indsæt --> Diasnummer) Dias 6
Statistical hypothesis testing
Department of Biostatistics
Hypothesis
We figure out what typical data would be if this hypothesis were
true
We observe the data that we have collected
We reject the hypothesis if these differ a lot
Sted og dato (Indsæt --> Diasnummer) Dias 7
Statistical hypothesis testing - practice
Department of Biostatistics
Hypothesis
It has been figured out that a statistic – a number we can
calculate on basis of the data – has a specific distribution if this
hypothesis were true
We calculate this statistic for the data that we have collected
We reject the hypothesis if the calculated statistic is extreme with respect to this distribution
Sted og dato (Indsæt --> Diasnummer) Dias 8
Department of Biostatistics
The hypothesis for the t-test
We aim to reject the null hypothesis:
H0 Null Hypothesis The mean MSCE is the same for
Native Americans and Caucasians.
H1 Alternative Hypothesis The mean MSCE is different for Native Americans and Caucasians.
If this is possible we then accept the alternative hypothesis:
Sted og dato (Indsæt --> Diasnummer) Dias 9
Department of Biostatistics
The simple t-test
Student’s t-test statistic2 compares the means of the two groups
If T is large we reject the hypothesis H0 of identical means; and we accept the alternative hypothesis H1 that the means are different.
If T is small we have to accept the null hypothesis H0.
______________ 2The statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. ”Student” was his pen name.
error standard
)2(mean)1(mean groupgroupT
Sted og dato (Indsæt --> Diasnummer) Dias 10
Statistical hypothesis testing – t-test
Department of Biostatistics
Null hypothesis H0 of identical means
Student’s t-test statistic has a t distribution – a distribution that
is very close to a Normal distribution – if the null hypothesis H0 were true
We calculate Student’s t-test statistic for the data that we
have collected
We reject the null hypothesis H0 if the calculated Student’s t-test statistic is extreme with respect
to the t distribution
Sted og dato (Indsæt --> Diasnummer) Dias 11
What is extreme?
• We reject H when Student’s t-test statistic is extreme; either positive or negative.
• A convention is that it is extreme when there is a chance of 5% or less that we would observe a statistic at least as extreme by chance
• P-value: the probability of observing by chance when the null hypothesis is true a statistic that is more extreme than the one we have calculated in the data.
Department of Biostatistics
15.9
P-value = Pr(T≥15.9 or T≤-15.9) = 0.0000001 < 0.05
Example:
Sted og dato (Indsæt --> Diasnummer) Dias 12
Statistical hypothesis testing – t-test
Department of Biostatistics
Null hypothesis H0 of identical means
Student’s t-test statistic has a t distribution if the null hypothesis
H0 were true
We calculate Student’s t-test statistic for the data that we
have collected
We reject the null hypothesis H0 if the p-value is lower than 0.05.
Sted og dato (Indsæt --> Diasnummer) Dias 13
Department of Biostatistics
Student’s t-test in R
> t.test(caucas, nativeAmer, var.equal=TRUE)
Two Sample t-test
data: caucas and nativeAmer
t = -1.7244, df = 14, p-value = 0.1066
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.9911730 0.1076809
sample estimates:
mean of x mean of y
8.131111 8.572857
Sted og dato (Indsæt --> Diasnummer) Dias 14
Department of Biostatistics
The idea behind Student’s t-test
The variability of the data has two components:
• Random part: Normal measurement errors mixed with Normal random variation.
• Systematic part: Different means for the two groups.
• Under the null hypothesis H0 it is assumed that the systematic part is zero. The Student’s t-test therefore examines if there is evidence of a difference between the two groups beyond what can be expected from the random part alone.
Sted og dato (Indsæt --> Diasnummer) Dias 15
Department of Biostatistics
The idea behind Student’s t-test
• With the t-test we test whether the means are different.
• However, we assume that the variances are equal in the two groups; i.e. the Random part does not differ between the groups.
Sted og dato (Indsæt --> Diasnummer) Dias 16
Department of Biostatistics
Bur are he variances really equal?
• If the assumption of equal variance seems questionable, one can use the unequal variance t-test, the Welch (two sample) t-test.
Sted og dato (Indsæt --> Diasnummer) Dias 17
Department of Biostatistics
The unequal variance t-test
> t.test(caucas, nativeAmer)
Welch Two Sample t-test
data: caucas and nativeAmer
t = -1.7009, df = 12.303, p-value = 0.1141
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0060725 0.1225804
sample estimates:
mean of x mean of y
8.131111 8.572857
Unequal variances is the default!
The confidence interval is a little broader: we lose some efficiency.
Sted og dato (Indsæt --> Diasnummer) Dias 18
Department of Biostatistics
But maybe the random part is not Normal?
Sted og dato (Indsæt --> Diasnummer) Dias 19
A consequence of the Central Limit Theorem
• If we have a lot of observations, Student’s t-test statistic is approximately Normal distributed when the observations are independent, identically distributed.
• Note that the observations do not have to be Normal, or even close to Normal for the t-test to be a valid test. We just require a lot of observations; in practice not even so many…
• In other words: the t-test is a fine test in many situations.
• However, it may be that a mean does not have meaning in the distributions that are regarded (e.g. because of outliers), or that the distributions are different in the two groups (or that we don’t want to bother about all these assumptions on distributions).
Department of Biostatistics
Sted og dato (Indsæt --> Diasnummer) Dias 20
Department of Biostatistics
Wilcoxon-Mann-Whitney test
• The t-tests only have a clear interpretation if the data are well represented by the mean values and the standard deviations, e.g. when the data within the groups that are compared is Normal. If this is questionable one can instead use Wilcoxon’s rank-sum test.
• Wilcoxon’s rank-sum test is a non-parametric test, which is based solely on the order in which the observations from the two samples fall.
Sted og dato (Indsæt --> Diasnummer) Dias 21
Department of Biostatistics
Example – Data Caucasian Ranks Native American Ranks
8.27 9 8.50 11
8.20 7 9.48 16
8.25 8 8.65 13
8.14 5 8.16 6
9.00 15 8.83 14
8.10 4 7.76 3
7.20 1 8.63 12
8.32 10 - -
7.70 2 - -
Average 8.13 6.78 8.57 10.71
Each block corresponds to an observation ordered by rank. Caucasians are red & Native Americans are blue
Visualization:
Sted og dato (Indsæt --> Diasnummer) Dias 22
Department of Biostatistics
However, the question is similar
We aim to reject the null hypothesis:
H0 Null Hypothesis For randomly chosen MSCE values
A and B from the Native Americans and Caucasians resp. the probability that A>B is 0.5.
H1 Alternative Hypothesis For randomly chosen MSCE values A and B from the Native Americans and Caucasians resp. the probability that A>B is different from 0.5.
• Under the null hypothesis the sum of the ranks in the two groups (relative to the group size) are equal.
• The departure from the null hypothesis that the Wilcoxon test tries to detect are also location shifts, but it does so without further assumptions about the shape of the distribution.
Sted og dato (Indsæt --> Diasnummer) Dias 23
Department of Biostatistics
Wilcoxon’s rank sum test
> wilcox.test(caucas, nativeAmer)
Wilcoxon rank sum test
data: caucas and nativeAmer
W = 16, p-value = 0.1142
alternative hypothesis: true location shift is not equal to 0
Sted og dato (Indsæt --> Diasnummer) Dias 24
Department of Biostatistics
Wilcoxon vs. t-test – Discussion
• If the data are actually normal the t-test is more efficient than the Wilcoxon test.
• The Wilcoxon test remains applicable independently of whether the data are normal. Notably so when there are outliers (Wilcoxon is more robust)
• Note however that all these tests may detect differences between two groups which does not mean that these differences are because of the grouping; other, not considered factors may be to blame… (more on this Nov 10th)
Sted og dato (Indsæt --> Diasnummer) Dias 25
Department of Biostatistics
Transformations
• Some types of data are inherently right skewed. For instance will measurements of chemical concentrations almost always be right skewed.
• In addition the variance of the associated measurement error often depends on the size of the observation (e.g. the precision of the measurement is given as +/- 5% instead of +/- 5 ppm).
• A logarithmic transformation of the data can remove or reduce the right skewness and stabilize the variance.
• The statistical analysis can then be conducted using the t-test on the transformed data; means and confidence intervals should be transformed back to the original scale before presentation.
Sted og dato (Indsæt --> Diasnummer) Dias 26
Department of Biostatistics
Paired data
• Sometimes the data in the two groups are the result from measuring the same subjects twice (e.g. before and after some treatment), or when to each subject in one group there is a matching subject in the other group (e.g. twin data).
• These data are not independent anymore, data from the matched pair are more related to each other than to data from other pairs.
• A different version of the t-test (or the Wilcoxon test) has to be used (in R add the option: paired=TRUE).
• This test uses that the variability of values of the matched pair will be much smaller than the variability of the values from different pairs.
Sted og dato (Indsæt --> Diasnummer) Dias 27
Department of Biostatistics
Reporting the result of a statistical test
• The p-value is the probability of having observed the data (or more extreme data) when the null hypothesis is true. The result of the test is called statistically significant at the significance level 5% if the p-value is smaller than 0.05.
• A 95% confidence interval for a parameter estimate is computed on the basis of the data and it is the interval in which the true parameter is located with 95% probability.
• Confidence intervals are in many cases – especially when the statistic has a (clinical) interpretation – more informative and should also be given where possible.
Sted og dato (Indsæt --> Diasnummer) Dias 28
Department of Biostatistics
Chance findings
Even if there is no real difference between two groups, it is
possible that significant results occur…
…we set the probability of this to 5% (significance level).
But also when there is a real difference, it is possible that our test not insignificant...
…the probability of this depends on the number of observations and the size of the difference.
Sted og dato (Indsæt --> Diasnummer) Dias 29
Department of Biostatistics
Trusting the test result?
When conducting a statistical test one can fall victim of the two following errors:
1. A true null hypothesis is rejected (false positive or type I error).
2. A false null hypothesis is not rejected. (false negative or type II error).
True state of the Null Hypothesis
H0 true H0 false
Sta
tistic
al
decis
ion
Reject H0 Type I error Correct
Do not reject H0 Correct Type II error
Sted og dato (Indsæt --> Diasnummer) Dias 30
Department of Biostatistics
Multiple testing
• Often we perform a lot of hypothesis tests simultaneously.
• This increases the chance of having at least one false positive finding in the statistical analysis (1 in 20).
• Suppose you have 10,000 genes on a chip and not a single one is actually differentially expressed. You would expect 10,000*0.05 = 500 of them to have a p-value < 0.05.
• An individual p-value of e.g. 0.04 does not correspond to significant findings anymore taken into account the number of tests performed.
Sted og dato (Indsæt --> Diasnummer) Dias 31
Department of Biostatistics
The Bonferroni correction
• To control the risk of having at most one chance finding in a statistical analysis with M simultaneous tests one divides the significance level by M.
• For example, to test two independent hypotheses on the same data at an overall 0.05 significance level, instead of using a p-value threshold of 0.05, one would use a stricter threshold of 0.025.
• Alternatives to the Bonferroni correction exist (e.g. the false discovery rate by Benjamini & Hochberg), but none can resolve the problem without reducing the p-value threshold employed.
Sted og dato (Indsæt --> Diasnummer) Dias 32
Department of Biostatistics
Example: Weiler et al. 2008, Growth Dev Aging,71(1):35-43.
Objective: To compare bone mass in newborn infants of First Nations, white and Asian mothers while accounting for vitamin D status. Fifty infants born healthy at term age were measured for bone mass using dual energy x-ray absorptiometry (DXA) within 15 days of life. Vitamin D status was measured as 25(OH)D in cord plasma. White infants were separated based on 25(OH)D concentrations into sufficient and insufficient (< 32.5 nmol/L) to match for vitamin D status of the Asian infants and the First Nations group.
Differences among groups were tested using ANOVA and post hoc testing with Bonferroni multiple comparisons test. There were no differences in whole body, spine or femur BMC between the white sufficient and insufficient infants. However, the Asian infants had lower (P < 0.01) spine BMC compared to the white infants and the First Nations infants were intermediate. No differences among the ethnic groups were observed for whole body or femur BMC. These data suggest that white and First Nations newborn infants have comparable bone mass. Asian infants have lower spine bone mass which is more than a factor of body size and independent of vitamin D status at birth.
Sted og dato (Indsæt --> Diasnummer) Dias 33
Power calculations
• Power: the probability to detect an effect of a certain magnitude.
• Often we can before data is collected already make a statement on what the variance in the data will be and what magnitude of effect we at least would want to detect and with what probability.
• This detection depends on N, and we may calculate the N which achieves this; this is a power calculation.
Department of Biostatistics
Sted og dato (Indsæt --> Diasnummer) Dias 34
Department of Biostatistics
To test more than 2 groups (ANOVA)
• F-test (corresponds to t-test)
• Kruskal-Wallis test (corresponds to Wilcoxon test)
Sted og dato (Indsæt --> Diasnummer) Dias 35
Department of Biostatistics
Take home message
“p < 0.001, aha: if the null hypothesis was true, then, most likely, the data would have looked different.
It is therefore reasonable to conclude that the null hypothesis is false.”