Basic Statistics and Data Analysis for Health Researchers from...

Sted og dato (Indsæt --> Diasnummer) Dias 1

Department of Biostatistics

Volkert Siersma [email protected]

The Research Unit for General Practice in Copenhagen

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

mailto:[email protected]



Content

Comparison of two groups.

In particular:

• The simple t-test

• Unequal variance

• Non-normal

• Need for transformation

• Paired data

• Multiple testing



Example

In a genetic inheritance study1, blood samples samples of

individuals from several ethnic groups were taken. We shall

compare the groups labeled “Native American” and “Caucasian”

with respect to the variable MSCE (mean sister chromatid

exchange).

_____________ 1Margolin (1988) Statistical Science,3,351–357



Example – Data

Caucasian Native American

8.27 8.50

8.20 9.48

8.25 8.65

8.14 8.16

9.00 8.83

8.10 7.76

7.20 8.63

8.32 -

7.70 -

Average: 8.13 8.57



Example – Visualization

Measurements of MSCE.

Caucasians in red & Native Americans in blue.

Group averages are indicated by arrows.

But what is the question…?


Statistical hypothesis testing


Hypothesis

We figure out what typical data would be if this hypothesis were

true

We observe the data that we have collected

We reject the hypothesis if these differ a lot


Statistical hypothesis testing - practice


Hypothesis

It has been figured out that a statistic – a number we can

calculate on basis of the data – has a specific distribution if this

hypothesis were true

We calculate this statistic for the data that we have collected

We reject the hypothesis if the calculated statistic is extreme with respect to this distribution



The hypothesis for the t-test

We aim to reject the null hypothesis:

H0 Null Hypothesis The mean MSCE is the same for

Native Americans and Caucasians.

H1 Alternative Hypothesis The mean MSCE is different for Native Americans and Caucasians.

If this is possible we then accept the alternative hypothesis:



The simple t-test

Student’s t-test statistic2 compares the means of the two groups

If T is large we reject the hypothesis H0 of identical means; and we accept the alternative hypothesis H1 that the means are different.

If T is small we have to accept the null hypothesis H0.

______________ 2The statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. ”Student” was his pen name.

error standard

)2(mean)1(mean groupgroupT


Statistical hypothesis testing – t-test


Null hypothesis H0 of identical means

Student’s t-test statistic has a t distribution – a distribution that

is very close to a Normal distribution – if the null hypothesis H0 were true

We calculate Student’s t-test statistic for the data that we

have collected

We reject the null hypothesis H0 if the calculated Student’s t-test statistic is extreme with respect

to the t distribution


What is extreme?

• We reject H when Student’s t-test statistic is extreme; either positive or negative.

• A convention is that it is extreme when there is a chance of 5% or less that we would observe a statistic at least as extreme by chance

• P-value: the probability of observing by chance when the null hypothesis is true a statistic that is more extreme than the one we have calculated in the data.


15.9

P-value = Pr(T≥15.9 or T≤-15.9) = 0.0000001 < 0.05

Example:


Statistical hypothesis testing – t-test


Null hypothesis H0 of identical means

Student’s t-test statistic has a t distribution if the null hypothesis

H0 were true

We calculate Student’s t-test statistic for the data that we

have collected

We reject the null hypothesis H0 if the p-value is lower than 0.05.



Student’s t-test in R

> t.test(caucas, nativeAmer, var.equal=TRUE)

Two Sample t-test

data: caucas and nativeAmer

t = -1.7244, df = 14, p-value = 0.1066

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.9911730 0.1076809

sample estimates:

mean of x mean of y

8.131111 8.572857



The idea behind Student’s t-test

The variability of the data has two components:

• Random part: Normal measurement errors mixed with Normal random variation.

• Systematic part: Different means for the two groups.

• Under the null hypothesis H0 it is assumed that the systematic part is zero. The Student’s t-test therefore examines if there is evidence of a difference between the two groups beyond what can be expected from the random part alone.



The idea behind Student’s t-test

• With the t-test we test whether the means are different.

• However, we assume that the variances are equal in the two groups; i.e. the Random part does not differ between the groups.



Bur are he variances really equal?

• If the assumption of equal variance seems questionable, one can use the unequal variance t-test, the Welch (two sample) t-test.



The unequal variance t-test

> t.test(caucas, nativeAmer)

Welch Two Sample t-test


t = -1.7009, df = 12.303, p-value = 0.1141

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.0060725 0.1225804

sample estimates:

mean of x mean of y

8.131111 8.572857

Unequal variances is the default!

The confidence interval is a little broader: we lose some efficiency.



But maybe the random part is not Normal?


A consequence of the Central Limit Theorem

• If we have a lot of observations, Student’s t-test statistic is approximately Normal distributed when the observations are independent, identically distributed.

• Note that the observations do not have to be Normal, or even close to Normal for the t-test to be a valid test. We just require a lot of observations; in practice not even so many…

• In other words: the t-test is a fine test in many situations.

• However, it may be that a mean does not have meaning in the distributions that are regarded (e.g. because of outliers), or that the distributions are different in the two groups (or that we don’t want to bother about all these assumptions on distributions).




Wilcoxon-Mann-Whitney test

• The t-tests only have a clear interpretation if the data are well represented by the mean values and the standard deviations, e.g. when the data within the groups that are compared is Normal. If this is questionable one can instead use Wilcoxon’s rank-sum test.

• Wilcoxon’s rank-sum test is a non-parametric test, which is based solely on the order in which the observations from the two samples fall.



Example – Data Caucasian Ranks Native American Ranks

8.27 9 8.50 11

8.20 7 9.48 16

8.25 8 8.65 13

8.14 5 8.16 6

9.00 15 8.83 14

8.10 4 7.76 3

7.20 1 8.63 12

8.32 10 - -

7.70 2 - -

Average 8.13 6.78 8.57 10.71

Each block corresponds to an observation ordered by rank. Caucasians are red & Native Americans are blue

Visualization:



However, the question is similar

We aim to reject the null hypothesis:

H0 Null Hypothesis For randomly chosen MSCE values

A and B from the Native Americans and Caucasians resp. the probability that A>B is 0.5.

H1 Alternative Hypothesis For randomly chosen MSCE values A and B from the Native Americans and Caucasians resp. the probability that A>B is different from 0.5.

• Under the null hypothesis the sum of the ranks in the two groups (relative to the group size) are equal.

• The departure from the null hypothesis that the Wilcoxon test tries to detect are also location shifts, but it does so without further assumptions about the shape of the distribution.



Wilcoxon’s rank sum test

> wilcox.test(caucas, nativeAmer)

Wilcoxon rank sum test


W = 16, p-value = 0.1142

alternative hypothesis: true location shift is not equal to 0



Wilcoxon vs. t-test – Discussion

• If the data are actually normal the t-test is more efficient than the Wilcoxon test.

• The Wilcoxon test remains applicable independently of whether the data are normal. Notably so when there are outliers (Wilcoxon is more robust)

• Note however that all these tests may detect differences between two groups which does not mean that these differences are because of the grouping; other, not considered factors may be to blame… (more on this Nov 10th)



Transformations

• Some types of data are inherently right skewed. For instance will measurements of chemical concentrations almost always be right skewed.

• In addition the variance of the associated measurement error often depends on the size of the observation (e.g. the precision of the measurement is given as +/- 5% instead of +/- 5 ppm).

• A logarithmic transformation of the data can remove or reduce the right skewness and stabilize the variance.

• The statistical analysis can then be conducted using the t-test on the transformed data; means and confidence intervals should be transformed back to the original scale before presentation.



Paired data

• Sometimes the data in the two groups are the result from measuring the same subjects twice (e.g. before and after some treatment), or when to each subject in one group there is a matching subject in the other group (e.g. twin data).

• These data are not independent anymore, data from the matched pair are more related to each other than to data from other pairs.

• A different version of the t-test (or the Wilcoxon test) has to be used (in R add the option: paired=TRUE).

• This test uses that the variability of values of the matched pair will be much smaller than the variability of the values from different pairs.



Reporting the result of a statistical test

• The p-value is the probability of having observed the data (or more extreme data) when the null hypothesis is true. The result of the test is called statistically significant at the significance level 5% if the p-value is smaller than 0.05.

• A 95% confidence interval for a parameter estimate is computed on the basis of the data and it is the interval in which the true parameter is located with 95% probability.

• Confidence intervals are in many cases – especially when the statistic has a (clinical) interpretation – more informative and should also be given where possible.



Chance findings

Even if there is no real difference between two groups, it is

possible that significant results occur…

…we set the probability of this to 5% (significance level).

But also when there is a real difference, it is possible that our test not insignificant...

…the probability of this depends on the number of observations and the size of the difference.



Trusting the test result?

When conducting a statistical test one can fall victim of the two following errors:

1. A true null hypothesis is rejected (false positive or type I error).

2. A false null hypothesis is not rejected. (false negative or type II error).

True state of the Null Hypothesis

H0 true H0 false

Sta

tistic

al

decis

ion

Reject H0 Type I error Correct

Do not reject H0 Correct Type II error



Multiple testing

• Often we perform a lot of hypothesis tests simultaneously.

• This increases the chance of having at least one false positive finding in the statistical analysis (1 in 20).

• Suppose you have 10,000 genes on a chip and not a single one is actually differentially expressed. You would expect 10,000*0.05 = 500 of them to have a p-value < 0.05.

• An individual p-value of e.g. 0.04 does not correspond to significant findings anymore taken into account the number of tests performed.



The Bonferroni correction

• To control the risk of having at most one chance finding in a statistical analysis with M simultaneous tests one divides the significance level by M.

• For example, to test two independent hypotheses on the same data at an overall 0.05 significance level, instead of using a p-value threshold of 0.05, one would use a stricter threshold of 0.025.

• Alternatives to the Bonferroni correction exist (e.g. the false discovery rate by Benjamini & Hochberg), but none can resolve the problem without reducing the p-value threshold employed.



Example: Weiler et al. 2008, Growth Dev Aging,71(1):35-43.

Objective: To compare bone mass in newborn infants of First Nations, white and Asian mothers while accounting for vitamin D status. Fifty infants born healthy at term age were measured for bone mass using dual energy x-ray absorptiometry (DXA) within 15 days of life. Vitamin D status was measured as 25(OH)D in cord plasma. White infants were separated based on 25(OH)D concentrations into sufficient and insufficient (< 32.5 nmol/L) to match for vitamin D status of the Asian infants and the First Nations group.

Differences among groups were tested using ANOVA and post hoc testing with Bonferroni multiple comparisons test. There were no differences in whole body, spine or femur BMC between the white sufficient and insufficient infants. However, the Asian infants had lower (P < 0.01) spine BMC compared to the white infants and the First Nations infants were intermediate. No differences among the ethnic groups were observed for whole body or femur BMC. These data suggest that white and First Nations newborn infants have comparable bone mass. Asian infants have lower spine bone mass which is more than a factor of body size and independent of vitamin D status at birth.


Power calculations

• Power: the probability to detect an effect of a certain magnitude.

• Often we can before data is collected already make a statement on what the variance in the data will be and what magnitude of effect we at least would want to detect and with what probability.

• This detection depends on N, and we may calculate the N which achieves this; this is a power calculation.




To test more than 2 groups (ANOVA)

• F-test (corresponds to t-test)

• Kruskal-Wallis test (corresponds to Wilcoxon test)



Take home message

“p < 0.001, aha: if the null hypothesis was true, then, most likely, the data would have looked different.

It is therefore reasonable to conclude that the null hypothesis is false.”

Date post:	27-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Basic Statistics and Data Analysis for Health Researchers from...

Documents