Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students...

Introduction to Biostatistical AnalysisUsing R

Statistics course for first-year PhD students

Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 0498272807http://www.biodiversity-lorenzomarini.eu/

Session 2

Lecture: Introduction to statistical hypothesis testing

Null and alternate hypothesis. Types of error. Two-sample hypotheses. Correlation. Analysis of frequency data.Introduction to statistical modeling

mailto:[email protected]

Inference

Sample

A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well do the findings fit the possibility that chance factors alone might be responsible?”.

Population

Statistical Model

sampling

Estimation(Uncertainty!!!)

testing

Statistical testing in five steps:

1. Construct a null hypothesis (H0) and alterantive hypothesis

2. Choose a statistical analysis (assumptions!!!)

3. Collect the data (sampling)

4. Calculate P-value and test statistic

5. Reject/accept (H0) if P is small/large

Key concepts: Session 1

Concept of replication vs. pseudoreplication1. Spatial dependence (e.g. spatial autocorrelation)2. Temporal dependence (e.g. repeated measures)3. Biological dependence (e.g. siblings)

n

ymean i 2

)( meanySSdeviance i

)1(

)(var

2

n

meanyi

Key quantities

meanresidual

yi

Remember the order!!!

x

y

n=6

4

1. Costruire e testare un’ipotesi

Ipotesi: affermazione che ha come oggetto accadimenti nel mondo reale, che si presta ad essere confermata o smentita

dai dati osservati sperimentalmente

Esempio: gli studenti maschi e femmine presentano gli stessi voti

5


Ipotesi nulla (H0): è un’affermazione riguardo alla popolazione che si assume essere vera fino a che non ci sia una prova evidente del

contrario (status quo, mancanza di effetto etc.)

Ipotesi alterantiva (Ha): è un’affermazione riguardo alla popolazione che è contraria all’ipotesi nulla e che viene accettata solo nel caso in

cui ci sia una prova evidente in suo favore

6


Test di ipotesi consiste in una

decisione fra H0 e Ha

1. Rifiutare H0 (e quindi accettare Ha)

2. Accettare H0 (e quindi rifiutare Ha)

7


1. Rifiutare H0 2. Accettare H0

La statistica inferenziale ci permette di quantificare delle probabilità per decidere se accettare o rifiutare l’ipotesi nulla:

Quanto attendibile è H0?

?

Livello di significatività (alpha)

Devo definire a priori una probabilità (alpha) per rifiutare l’ipotesi nulla

Il livello di significatività di un test: probabilità di rifiutare H0, quando in realtà è vera (quanto confidenti siamo nelle nostre

conclusioni?)

Più piccola è alpha maggiore sarà la certezza nel rifiutare l’ipotesi nulla

Valori usuali sono 10%, 5%, 1%, 0.1%

I valori più comuni

Hypothesis testing

• 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative hypothesis H1)

• 2 – Compute the probability PP-value is often described as the probability of seeing results as or more extreme as those actually observed if the null hypothesis was true

• 3 – If this probability is lower than a defined threshold (level of significance: 0.01, 0.05) we can reject the null hypothesis

Hypothesis testing: Types of error

As power increases, the chances of a Type II error decreases

Statistical power depends on:-The statistical significance criterion used in the test -The size of the difference or the strength of the similarity (effect size)-Variability of the population -Sample size-Type of test

http://en.wikipedia.org/wiki/Statistical_significance

http://en.wikipedia.org/wiki/Effect_size

Statistical analyses

Mean comparisons for 2 populationsTest the difference between the means drawn by two samples

CorrelationIn probability theory and statistics, correlation, (often measured as a

correlation coefficient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence.

Analysis of count or proportion dataWhole number or integer numbers (not continuous, different

distributional properties) or proportion

http://en.wikipedia.org/wiki/Probability_theory

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Random_variables

Mean comparisons for 2 samples

Assumptions• Independence of cases (work with true replications!!!) - this is a requirement

of the design.

• Normality - the distributions in each of the groups are normal

• Homogeneity of variances - the variance of data in groups should be the same (use Fisher test or Fligner's test for homogeneity of variances).

• These together form the common assumption that the errors are independently, identically, and normally distributed

H0: means do not differ H1: means differ

The t test

http://en.wikipedia.org/wiki/Statistical_independence

http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Homoscedasticity

http://en.wikipedia.org/wiki/Levene%27s_test

mass

Fre

qu

en

cy

0 5 10 15

01

03

05

0

Normality

Before we can carry out a test assuming normality of the data we need to test our distribution (not always before!!!)

Graphics analysis

Shapiro-Wilk Normality Test shapiro.test()

Test for normality

In many cases we must check this assumption after having fitted the

model

(e.g. regression or multifactorial

ANOVA)

Skew + kurtosis (t test)

-2 -1 0 1 2

0.5

1.5

2.5

Normal qqplot

norm quantiles

Ob

serv

ed

qu

an

tile

s

hist(y)lines(density(y))

library(car)qq.plot(y) or qqnorm(y)

RESIDUALS MUST BE NORMAL

Normality: Histogram

Histogram of fishes$mas

fishes$mas

Fre

qu

en

cy

0 5 10 15

01

03

05

0

Histogram of log(fishes$mas)

log(fishes$mas)F

req

ue

ncy

-0.5 0.5 1.0 1.5 2.0 2.5

01

02

03

04

0

Normality: Histogram

library(animation)ani.options(nmax = 2000 + 15 -2, interval = 0.003)freq = quincunx(balls = 2000, col.balls = rainbow(1))# frequency tablebarplot(freq, space = 0)

2.5 4.5 6.5 8.5 10.5 12.5

01

00

20

03

00

40

0

Normal distribution must be symmetrical around the mean

Normality: Q-Q Plot

-3 -2 -1 0 1 2 3

51

01

5

norm quantiles

fish

es$

ma

ss

-3 -2 -1 0 1 2 3

0.0

1.0

2.0

norm quantileslo

g(f

ish

es$

ma

ss)

Normality: Quantile-Quantile Plot

Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The quantiles are the data values marking the boundaries between consecutive subsets

Normality

In case of non-normality: 2 possible approaches

1. Change the distribution (use GLMs)

Logaritmic (skewed data)

2. Data transformation

E.g. Poisson (count data)

E.g. Binomial (proportion)

Square-root

mass

Fre

quen

cy

0 5 10 15

010

3050

fishes$logmassF

requ

ency

-0.5 0.5 1.5 2.5

010

2030

40

Arcsin (percentage)

Probit (proportion)

Box-Cox transformation

Advanced statistics

Homogeneity of variance: two samples

Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. The test could not be simpler. It is called Fisher’s F

To compare two variances, all you do isdivide the larger variance by the smaller variance.

Test can be carried out with the var.test()

F<-var(A)/var(B)

qf(0.975,nA-1,nB-1)

F calculated

F critical

if the calculated F is larger thanthe critical value, we reject the null hypothesis

E.g. Students from TESAF vs. Students from DAFNAE

Homogeneity of variance : > two samples

It is important to know whether variance differs significantly from sample to sample. Constancy of variance (homoscedasticity) is the most important assumption underlying regression and analysis of variance. For multiple samples you can choose between theBartlett test and the Fligner–Killeen test.

Bartlett.test(response,factor)

There are differences between the tests: Fisher and Bartlett are very sensitive to outliers, whereas Fligner–Killeen is not

Fligner.test(response,factor)

Mean comparison

-Some assumptions not met? Non-parametric Wilcox.test() - The Wilcoxon signed-rank test is a non-parametric alternative to the Student's t-test for the case of two samples.

- All Assumptions met? Parametric t.test()

- t test with independent or paired sample

In many cases, a researcher is interesting in gathering information about two populations in order to compare them. As in statistical inference for one population parameter, confidence intervals and tests of significance are useful statistical tools for the difference between two population parameters.

Ho: the two means are the same

H1: the two means differ

22

Il test t

Misura legata alla differenza fra le medie

Misura di variabilità dentro i gruppi

Differenza medie

Variabilità dei gruppi

tcalcolato=

23

Il test t

Differenzafra le medie

Caso 1

Caso 2

Caso 3

Caso 4

A BA B

A B A B

Var

iabi

leV

aria

bile

Variabilità B

Variabilità A

24

Il test t

Differenza fra le medie

Errore standard della differenza

Differenza fra medie

t

Variabilità dentro i gruppi

t

Più estremo sarà t calcolato minore sarà Pmaggiore sarà la probabilità di rifiutare H0

tcalcolato=

25

Il test t

tcalcolato=

+ estremo sarà tcalcolato maggiore la probabilità di

rifiutare H0

P

-TcriticoTcritico

Differenza fra le medie

Errore standard della differenza

26

Come scegliere il test t giusto a partire dalle assunzioni

Indipendenza

NO SÌTest t appaiato Test t non appaiati

Test t perpop. omoschedastiche

21

22 ss 2

122 ss

Test t perpop. eteroschedastiche

Welch t-test(formula complessa

richiesto un PC)

21

2

21

11

)(

nnS

xxt

p

n

SD

tD

i

27

Campioni independenti omoschedastici: Test t!

21

2

21

11

)(

nnS

xxt

p

calcolato

?

))1()1(

)1()1(

21

222

2112

nn

SnSnS p

Varianza combinata (”pooled”)

I gradi di libertà sono n1 + n2-2 per Tcritico

28

Campioni independenti omoschedastici: Test t!

I gradi di libertà sono n1+n2-2 per Tcritico

Test di ipotesi:1. Calcolo la varianza combinata dei due campioni2. Determino il valore di tcalcolato

3. Decido il livello di significatività (alpha, 1 o 2 code?)4. Determino il valore di tcritico

5. Se |tcalcolato|> |t critico| rifiuto H06. Conclusione: le medie sono DIVERSE!

H0: le due medie sono ugualiHa: le due medie sono diverse

29

Campioni appaiati: 2 casi

Studente Prima DopoA 22 23B 23 24C 24 24D 25 25E 20 21F 18 18G 18 18H 19 20

1. Misure ripetute 2. Correlazione nello spazio

Industria tessile

Misuraa valle

Misuraa monte

Fiume A

Fiume B

Fiume C

[Ammoniaca] in acqua

30

Campioni appaiati: Test t

n

SD

tD

n

DD i

1

)( 2

n

DDS iD

Media delle differenze

Deviazione standard delle differenze

n Numero di coppie

Studente Prima Dopo Di

A 22 23 1B 23 24 1C 24 24 0D 25 25 0E 20 21 1F 18 18 0G 18 18 0H 19 20 1

I gradi di libertà sono n-1 per tcritico

31

I gradi di libertà sono n-1 per tcritico

Test di ipotesi:1. Determino il valore di tcalcolato

2. Decido il livello di significatività (alpha, 1 o 2 code?)3. Determino il valore di tcritico

4. Se |tcalcolato|> |tcritico| rifiuto H05. Conclusione: le medie sono DIVERSE!

H0: le due medie sono ugualiHa: le due medie sono diverse

Campioni appaiati: Test t

?

Non parametrica: Wilcoxon

I ranghi

A B3 54 54 63 72 43 41 33 55 62 5

20B72018.5B61918.5B618

15B51715B51615B51515B51415A513

10.5B41210.5B41110.5A41010.5A49

6B386A376A366A356A34

2.5A232.5A22

1A11rankslabelozone

20B72018.5B61918.5B618

15B51715B51615B51515B51415A513

10.5B41210.5B41110.5A41010.5A49

6B386A376A366A356A34

2.5A232.5A22

1A11rankslabelozone

Test can be carried out with the wilcox.test() function

111

21 2

)1(R

nnnnU

n1 and n2 sono I numeri delle osservazioni

R1 è la somma dei rnaghi nel campione 1

Correlation

Correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables

Three alternative approaches1. Parametric - cor()2. Nonparametric - cor()3. Bootstrapping - replicate(), boot()

Plant speciesrichness

1234…

458

Bird speciesrichness

x1

x2

x3

x4

…

x458

l1l2l3l4…

l458

Sampling unit

Correlation: causal relationship?

Which is the response variable in a correlation analysis?

Plant speciesrichness

1234…

458

Bird speciesrichness

x1

x2

x3

x4

…

x458

l1l2l3l4…

l458

Sampling unitNONE

Correlation

A correlation of +1 means that there is a perfect positive LINEARLINEAR relationship between variables. A correlation of -1 means that there is a perfect negative LINEARLINEAR relationship between variables.A correlation of 0 means there is no LINEARLINEAR relationship between the two variables.

Plot the two variables in a Cartesian space

Correlation

Same correlation coefficient!

r= 0.816

Assumptions-Two random variables from a random populations - cor() detects ONLY linear relationships

Pearson product-moment correlation coefficient

Correlation coefficient:

Hypothesis testing using the t distribution:Ho: Is cor = 0H1: Is cor ≠ 0

Parametric correlation: when is it significant?

t critic value for d.f. = n-2

22

)(

yx

xycor

2

)1( 2

n

corSEcor

corSE

cort

Rank procedures

Nonparametric correlation

Spearman correlation index

The Kendall tau rank correlation coefficient

22

)(.

yx

yx

rankrank

rankrankspearmancor

1)1(

4.

nn

Pkendallcor

P is the number of concordant pairsn is the total number of pairs

Distribution-free butless power

Issues related to correlation

2. Spatial autocorrelationValues in close sites are more similarDependence of the data

1. Temporal autocorrelationValues in close years are more similarDependence of the data

Moran's I = 0 Moran's I = 1Moran's I or Geary’s CMeasures of global spatial autocorrelation

Three issues related to correlation

2. Temporal autocorrelationValues in close years are more similarDependence of the data

Working with time series is likely to have temporal pattern in the data

E.g. Ring width series

Autoregressive models (not covered!)

Three issues related to correlation

3. Spatial autocorrelationValues in close sites are more similarDependence of the data

Moran's I or Geary’s C (univariate response) Measures of global spatial autocorrelation

Hint: If you find spatial autocorrelation in your residuals, you should start worrying

ISSUE: can we explain the spatial autocorrelation with our models?

Raw response

Residuals after model fitting

>a<-c(1:5)> a[1] 1 2 3 4 5> replicate(10, sample(a, replace=TRUE)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10][1,] 2 3 2 1 4 2 1 2 1 3[2,] 1 5 2 3 5 3 1 1 3 2[3,] 4 4 4 5 4 4 5 1 1 5[4,] 4 1 1 3 3 2 3 1 5 2[5,] 5 5 1 3 5 2 4 1 5 4

Estimate correlation with bootstrap

BOOTSTRAP

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of SEs and CIs of a population parameter

Sampling with replacement

1 original sample

10 bootstrapped samples10 bootstrapped samples


Why bootstrap?

It doesn’t depend on normal distribution assumptionIt allows the computation of unbiased SE and CIs

Sample Bootstrap

Statisticdistribution

Quantiles

N sampleswith

replacement

…


CIs are asymmetric because our distribution reflects the structure of the data and not a defined probability distribution

If we repeat the sample n time we will find 0.95*n values included in the CIs

Frequency data

Properties of frequency data:-Count data-Proportion data

Proportion dataProportion data: where we know the number doing a particular thing, but also the number not doing that thing (e.g. ‘mortality’ of the students who attend the first lesson, but not the second)

Count dataCount data: where we count how many times something happened, but we have no way of knowing how often it did not happen (e.g. number of students coming at the first lesson)

Straightforward linear methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons:

• The linear model might lead to the prediction of negative counts.• The variance of the response variable is likely to increase with the mean.• The errors will not be normally distributed.• Many zeros are difficult to handle in transformations.

Count data

- Classical test with contingency tables- Generalized linear models with Poisson distribution and log-link function (extremely powerful and flexible!!!)

- Pearson’s chi-squared (- Pearson’s chi-squared (χχ22))- G test- G test- Fisher’s exact test- Fisher’s exact test

Count data: contingency tables

Group 1 Group 2 Row total

Trait 1 a b a+b

Trait 2 c d c+d

Column total a+c b+d a+b+c+d

We can assess the significance of the differences betweenobserved and expected frequencies in a variety of ways:

H0: frequencies found in rows are independent from frequencies in columns

- Pearson’s chi-squared (- Pearson’s chi-squared (χχ22))

We need a model to define the expected frequencies (E)expected frequencies (E)(many possibilities) – E.g. perfect independence


Oak Beech Row total (Ri)

With ants 22 30 52

Without ants 31 18 49

Column total (Ci) 53 48 101 (G)X

Critic value

G

)C x (R iiiE 1)-(c x 1)-(r df

i

i2

ii2

E

/E)E-(O

- G test- G test

1. We need a model to define the expected frequencies (E)expected frequencies (E)(many possibilities) – E.g. perfect independence

χ2 distribution


If expected values are less than 4 o 5

- Fisher’s exact test - Fisher’s exact test fisher.test()fisher.test()

G

)C x (R iiiE

i

ii E

OO ln 2 G

Proportion data

Proportion data have three important properties that affect the way the data should be analyzed:• the data are strictly bounded (0-1);• the variance is non-constant (it depends on the mean);• errors are non-normal.

- Classical test with probit or arcsin transformation- Generalized linear models with binomial distribution and logit-link function (extremely powerful and flexible!!!)

Arcsine transformation

The arcsine transformation takes care of the error distribution

Proportion data: traditional approach

Probit transformation

The probit transformation takes care of the non-linearity

pp arcsin' p are percentages (0-100%)

p are proportions (0-1)

Transform the data!

An important class of problems involves data on proportions such as:• studies on percentage mortality (LD50),• infection rates of diseases,• proportion responding to clinical treatment (bioassay),• sex ratios, or in general• data on proportional response to an experimental treatment

Proportion data: modern analysis

1. It is often needed to transform both response and explanatory variablesor

2. To use Generalized Linear Models (GLM) using different error distributions

2 approaches2 approaches

Statistical modelling

MODEL

Generally speaking, a statistical model is a function of your

explanatory variables to explain the variation in your response

variable (y)

The best model is the model that produces the least unexplained variation (the minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant (many ways to reach this!)

The object is to determine the values of the parameters (a, b, c and d) in a specific model that lead to the best fit of the model to the data

2)( meanySSdeviance i

E.g. Y=a+bx1+cx2+ dx3

Y= response variable (performance of the students)

xi= explanatory variables (ability of the teacher, background, age)


Getting started with complex statistical modeling

It is essential, that you can answer the following questions:• Which of your variables is the response variable?

• Which are the explanatory variables?

• Are the explanatory variables continuous or categorical, or a mixture of both?

• What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category?


Getting started with complex statistical modeling

The explanatory variables(a) All explanatory variables continuous - Regression(b) All explanatory variables categorical - Analysis of variance (ANOVA)(c) Explanatory variables both continuous and categorical - Analysis of covariance (ANCOVA)

The response variable(a) Continuous - Normal regression, ANOVA or ANCOVA(b) Proportion - Logistic regression, GLM logit-linear models(c) Count - GLM Log-linear models(d) Binary - GLM binary logistic analysis(e) Time at death - Survival analysis

1. MulticollinearityCorrelation between predictors in a non-orthogonal multiple linear modelsConfounding effects difficult to separate

Variables are not independent

This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model.

In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model

Statistical modelling: multicollinearity

The order of variable selection makes a huge difference(please wait for session 4!!!)

You want the model to be minimal (parsimony), and adequate (must describe a significant fraction of the variation in the data)

It is very important to understand that there is not just one model.

• given the data,• and given our choice of model,• what values of the parameters of that model make the observed

data most likely?

Model building: estimate of parameters(slopes and level of factors)

Occam’s Razor


Each analysis estimate a MODEL

• Models should have as few parameters as possible;• linear models should be preferred to non-linear models;• experiments relying on few assumptions should be preferred to those relying on many;• models should be pared down until they are minimal adequate;• simple explanations should be preferred to complex explanations.

The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model.

Occam’s Razor

MODEL SIMPLIFICATION


Date post:	01-May-2015
Category:	Documents
Upload:	paolina-damiani
View:	223 times
Download:	1 times

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students...

Documents