Download - Introduction to hypothesis testing 2015 · Introduction to hypothesis testing Review: Logic of Hypothesis Tests • Usually, we test (attempt to falsify) a null hypothesis (H0): –

Introduction to hypothesis testing

Review: Logic of Hypothesis Tests

• Usually, we test (attempt to falsify) a null hypothesis (H0):– includes all possibilities except prediction in

hypothesis (HA)

• If hypothesis (HA)is that an experimental treatment has an effect:– null hypothesis is that there is no effect

• Disproving H0 = evidence that actual hypothesis is true

Decision criterion• How low a probability should make us reject

H0?• If probability is less than significance level

(critical p-value, ), then reject H0; otherwise do not reject

• Convention sets significance level: = 0.05 (5%)

• Arbitrary:– other significance levels might be valid. Context

specific

Three special types of Hypothesis Tests based on the t distribution

1. The mean of a distribution is different from a constant (one sample t test)

2. The mean difference in pairs of observations is different from a constant (paired t test)

3. Two distributions differ (i.e. the means from two sets of observations do not come from the same distribution of means). Two sample t test.

t statisticGeneral form of t statistic:

where St is sample statistic, is parameter value specified in H0 and SE is standard error of sample statistic.

Specific form for population mean:

Value of meanspecified in H0

SE

St

ns

y

Test statistics

• Sampling distributions of t, one for each sample size, when H0 true– use degrees of freedom (df = n - 1)

• Area under each sampling (probability) distribution equals one

• Probabilities of obtaining particular ranges of t when H0 is true

Three special types of Hypothesis Tests based on the t distribution

1. The mean of a distribution is different from a constant. One sample t test

2. The mean difference in pairs of observations is different from a constant. Paired t test.

3. Two distributions differ (ie the means from two sets of observations do not come from the same distribution of means). Two sample t test.

Simple null hypothesis

• Test of hypothesis that population mean equals a particular value (H0: = )

• These values may be from literature or other research or legislation

One sample t-test

Populations are fairly stable if the ratio of births to deaths is close to 1.25.

Ho: B/D ratios = 1.25HA: B/D ratios = 1.25

1) Are the B/D ratios for any of these groups =1.25

2) Test using a one sample t-test

Ourworld

0

0.5

1

1.5

2

2.5

3

3.5

4

Mea

n(B_

To_D

)

Europe Islamic NewWorld

Group

t statisticGeneral form of t statistic:

where St is sample statistic, is parameter value specified in H0 and SE is standard error of sample statistic.

Specific form for population mean:

Value of meanspecified in H0

SE

St

ns

y

One sample t-tests

Single population:H0: = 0 (or any other pre-specified value:

here 1.25)

df = n - 1

ns

y – 1.25

s

yt

y

1.25

Results

1. Box plot2. Normal approximation3. Histogram

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

0.05 0.15 0.25Probability

Europe

More Results

Test MeanHypothesized ValueActual EstimateDFStd Dev

1.253.47825

151.17943

Test StatisticProb > |t|Prob > tProb < t

t Test7.5570

<.0001*<.0001*1.0000

-1 0 1 2 3 4

Test MeanHypothesized ValueActual EstimateDFStd Dev

1.253.95091

201.50949

Test StatisticProb > |t|Prob > tProb < t

t Test8.1995

<.0001*<.0001*1.0000

-2 -1 0 1 2 3 4

Islamic New World

Even more – a way to present the results

0

1

2

3

4

5

6

7

8

Birt

hs /

dea

ths

(95%

CI)

Ho:

Two sample t- test

• Used to compare two populations, each of which has been sampled

• The simplest form of tests among multiple populations

• Example: does the average annual income differ for males and females: – Ho: income (males) = income (females)

Survey20

5

10

15

20

25

Female Male

SEX

H0: 1 = 2, i.e. 1 - 2 = 0

- independent observations

df = (n1 - 1) + (n2 - 1) = n1 + n2 - 2

2121

212121 )(

yyyy s

yy

s

yyt

n1sp n2

1 1+

21 yy

Where sp = the pooled standard deviation (more later), and

Calculation:

y1 t =

y2

1

n1

1

n2

+sp

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

Pro

ba

bili

ty o

f t

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

6 7 8 9

HA true

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

Pro

ba

bili

ty o

f t

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4Ho true

Ho: = 2

HA: > 2

1) If Ho is true then the null distribution is known (for a set df)

2) If HA is true, we don’t know the distribution but we do know that it is not the null distribution

Logic of the two sample t test

Assume

Central t Non-Central t

Assume: Ho: = 2, 4 df

y1 t =

y2

1

n1

1

n2

+sp

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

6 7 8 9-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4Ho true

y1 t =

y2

1

n1

1

n2

+sp

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

6 7 8 9-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4Ho true

t 0.05, 4 df = 2.14

Any t >2.14 will lead to incorrect rejection of Ho

1. This means that the difference between y1 and y2

is > than 2.14 standard errors (pooled)

2. This will happen 5 % of the time

Assume: HA: > 2, 4 df

y1 t =

y2

1

n1

1

n2

+sp

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

6 7 8 9-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

y1 t =

y2

1

n1

1

n2

+sp

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

6 7 8 9-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 50.0

0.1

0.2

0.3

0.4HA true

t 0.05, 4 df = 2.14

Any t < 2.14 will lead to incorrect rejection of HA

1. This means that the difference between y1 and y2

is < than 2.14 standard errors (pooled)

2. The probability that this will happen is dependent on n and the true difference between and

Results of example

The unequal variance t-test is based on the Satterthwaite adjustment (of degrees of freedom), it is not recommended unless the variance terms are very different and the sample sizes (n) are very different

What is the conclusion?

Difference in Means

Difference in Means

0

10

20

30

40

50

60

70

0

10

20

30

40

50

60

70

0

5

10

15

20

25

Annu

al In

com

e (m

ean

+-

SE)

Female Male

SEX

Female Male

Paired t – tests: The logic of

1. Often there is interest in comparisons of observations that can be considered ‘paired” within a subject or replicate

a) For example:i. A comparison of activity level before and after eating in the

same individualii. A comparison of longevity of males vs females,where

county is the replicate

2. In such cases there is often benefit in accounting for variance that could be caused by differences among subjects (or replicates)

Paired observations: Paired t- test

H0: d = 0

where d is difference between betweenpaired observations

Where sd = standard deviation of the sample of differences, anddf = n - 1 where n is number of pairs

ds

dt

dnd

sd

Paired t-test – example II

• Pisaster comes in two colors along the west coast: purple and orange:

– Ho: density of purple per site = density of orange

– Individual reefs are the replicates of interest

– Looks like a no brainer

Sea star colors all sites two sample

Orange PurpleCOLOR

0

200

400

600

800

1000

1200

Den

sity

Results of a 2 sample test

Orange PurpleColor of seastars

0

200

400

600

800

1000

1200

De

nsi

ty (

95

% C

I)

PurpleOrange

COLOR

0

200

400

600

800

1000

1200

NU

MB

ER

0123456789Count

0 1 2 3 4 5 6 7 8 9Count

Marginally significantWHY?

¦ StandardGROUP ¦ N Mean Deviation-------+--------------------------Orange ¦ 7 144.71429 101.75086Purple ¦ 7 457.28571 353.47829

Pooled VarianceDifference in Means : -312.57143 95.00% Confidence Interval : -615.48591 to -9.65695 t : -2.24827 df : 12.00000 p-value : 0.04413

Consider the variability added at the level of replicate (site)

Govpt

BoatStair

Shell Beach

Hazards

Cayuco

sPSN

Site

0

200

400

600

800

1000

1200

Den

sity

Given that observations are paired at the level of site – can this be accounted for

Orange PurpleCOLOR

0

200

400

600

800

1000

1200

Den

sity

Govpt

BoatStair

Shell Beach

Hazards

Cayucos

PSN

SITE

0

200

400

600

800

1000

1200

Den

sity

PurpleOrange

COLOR

Paired test: Details of calculationSite Purple Orange differenceGovpt 1023 306 717Boat 585 155 430Stair 476 143 333PSN 233 142 91Cayucos 107 31 76Hazards 728 222 506Shell Beach 49 14 35

mean 312.5714Sediff 97.25882t 3.21381

ORANGE PURPLEIndex of Case

0

200

400

600

800

1000

1200

Va

lue

Note slopes – are they the same:Perhaps rates are a better comparison1) Convert to rates or2) Log transform

Paired test: Details of calculation:use of Log transformed data

Note slopes – much more similarIndicates that:1) Purples are more common

• By a constant ratio –rather than by a constant amount

Site Purple(log) Orange(log) differenceGovpt 3.0098756 2.4857214 0.524154Boat 2.7671559 2.1903317 0.576824Stair 2.677607 2.155336 0.522271PSN 2.3673559 2.1522883 0.215068Cayucos 2.0293838 1.4913617 0.538022Hazards 2.8621314 2.346353 0.515778Shell Beach 1.6901961 1.146128 0.544068

mean 0.490884Sediff 0.046604t 10.53299 LORANGE LPURPLE

Index of Case

1.0

1.5

2.0

2.5

3.0

3.5

Va

lue

Review – calculations of t for • One sample test

• Two sample test

• Paired test

ns

y

y1 y2

1

n1

1

n2

+sp

y1 y2

1

n1

1

n2

+sp

d

nd

sd

n

s

nd

sd

n1sp n2

1 1+ Sp =SS1+SS2

(n1-1)+(n2-1)

SS1+SS2

(n1-1)+(n2-1)

SS1+SS2

(n1+n2-2)

SS1+SS2

(n1+n2-2)=

Sd =

2

2

S =SS

(n-1)

SS

(n-1)

2

SSd

(nd -1)

SSd

(nd -1)

Calculations of Standard Error

1) One sample t-test

2) Paired t-test

3) Two sample t- test (calculation based on pooled variance term)

Testing statistical null hypotheses

Hypothesis construction

General Hypothesis

• A hypothesis that addresses the general question of interest

Ho: There will be no difference in the density of urchins on vertical vs horizontal surfaces

HA: There will be a difference in the density of urchins on vertical vs horizontal surface

Specific hypotheses

• A hypothesis that represents the specific question addressed in your study. The specifics include– Location of study

– Time period

– Replication

– Simple description of design

Specific Hypothesis

Ho: There will be no difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B

HA: There will be a difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B

Note much of this can be placed in the methods section, which would alleviate the need to state these details. However, also note that the hypotheses above are actually what are being tested

Depiction of hypotheses

Horizontal Density – Vertical Density of Urchins

- 0 +

Ho: There will be no difference in the density of (species name) on vertical vs horizontal surfaces based on 10 replicate quadrats for each treatment randomly placed within site A sampled on date B

Increasing likelihood that Ho is incorrectIncreasing likelihood that Ho is incorrect

Depiction of hypotheses:what should the units be?


- 0 +


Ho


• Goal– To use same units for all assessments – irrespective

of species or system

– To have same set of probabilities based on those units

– Hence - units should link to estimate of confidence• Most common form are t-values, which provide an

estimate of the difference in mean values calibrated by an estimate of error in the assessment of the mean values

T- statistic

1

2

N

XXSD

andN

SDSE

N

ii

SE

XXT

21

30404537

136.3

272.6

000.38

SE

SD

X

(Standard error)

(Standard deviation)

(Number of replicates)



- 0 +


Ho

SET =




Ho

SET =

-3 -2 -1 0 1 2 3

T-distribution (central t) is a null probability distribution

• Depicts the probability that the null hypothesis is correct

• One use is to estimate confidence levels

Depiction of hypotheses:



Ho

SET =

-3 -2 -1 0 1 2 3



Ho


SET =

-3 -2 -1 0 1 2 3


-3 -2 -1 0 1 2 3Horizontal Density – Vertical Density of Urchins

SET =


-3 -2 -1 0 1 2 3Horizontal Density – Vertical Density of Urchins

SET =


-3 -2 -1 0 1 2 3

95% CI


SET =

Including error yields a confidence interval e.g. 95% confident that the true t value is between….


-3 -2 -1 0 1 2 3

95% CI 2.5%2.5%

100% CI


SET =

The importance of directionality of the alternative hypothesis (HA)

Consider:


HA: There will be a difference in the density of urchins on vertical vs horizontal surfaces

vs

Ho1: Urchin density on horizontal surfaces will be greater than or equal to that on vertical surfaces

HA1: Urchins will be more dense on vertical than on horizontal surfaces

Ho1: Urchin density on horizontal surfaces will be greater than or equal to that on vertical

surfaces

-3 -2 -1 0 1 2 3

100% CI

5%


SET =

95% CI


-3 -2 -1 0 1 2 3

100% CI

5%


SET =

95% CI

One vs two tailed hypotheses-

-3 -2 -1 0 1 2 3

100% CI

5% 95% CI


SET =



1. Which is more interesting?2. Which is more informed?

-3 -2 -1 0 1 2 3

95% CI 2.5%2.5%

100% CI


-3 -2 -1 0 1 2 3

100% CI

5% 95% CI


SET =


1. Which is more powerful?

-3 -2 -1 0 1 2 3

95% CI 2.5%2.5%

100% CI


Example

• Replication on horizontal and vertical surfaces = 50 (100 total)

• Mean on Horizontal surfaces = 33.54

• Mean on Vertical Surfaces = 45.31

• Pooled standard deviation = 66.49

SE

XXT

vh 79.1

10049.66

32.4554.33

T


-3 -2 -1 0 1 2 3

100% CI

5% 95% CI


SET =


1. Which is more powerful?

-3 -2 -1 0 1 2 3

95% CI 2.5%2.5%

100% CI

T= -1.79, p=0.04 T= -1.79, p=0.08


One vs two tailed hypotheses-Conversion to original units

100% CI

5% 95% CI



95% CI 2.5%2.5%

100% CI

Difference = -11.78, p=0.04


-19.5 -13.3 -6.65 6.65 13.3 19.50 -19.5 -13.3 -6.65 6.65 13.3 19.50

Difference = -11.78, p=0.08

This is the difference between 1 and 2 tailed hypotheses – make sure you know which you

are dealing with

• Always strive for one tailed hypotheses

• Is there a directional prediction (eg > or separately <)– One tailed

• If not– Two tailed

Assumptions of t test

• The t test is a parametric test

• The t statistic only follows t distribution if:– variable has normal distribution (normality

assumption)

– two groups have equal population variances (homogeneity of variance assumption)

– observations are independent or specifically paired (independence assumption)

Normality assumption

• Data in each group are normally distributed• Checks:

– Frequency distributions – be careful– Boxplots– Probability plots– formal tests for normality

• Solutions:– Transformations– Don’t worry run it anyway – just kidding but not

entirely

Homogeneity of variance

• Population variances equal in 2 groups

• Checks:– subjective comparison of sample variances

– boxplots

– F-ratio test of H0: 12 = 2

2

• Solutions– Transformations

– Don’t worry run it anyway – just kidding again but again not entirely

F-test on variances

• H0: 12 = 2

2

• F statistic (F-ratio) = ratio of 2 sample variances– F = s1

2 / s22

– Reject H0 if F < or > 1

• If H0 is true, F-ratio follows F distribution

• Usual logic of statistical test

50 100 150 200 250 300 350LENGTH

Largest valueSmallest value

Median25% of values 25% of values

Boxplot

0 10 20 30 40 50 60 70 80 90

Limpet numbers per quadrat

0

10

20

30

40

50

60

70

Cou

nt

1. IDEAL 2. SKEWED

4. UNEQUAL VARIANCES3. OUTLIERS

*

*

**

*

Use of transformations to control departures from normality and homogeneity of variances

assumptions

Ourworld

Pop_1990 Lpop1990

Europe 441 0.17

Islamic 1378 0.30

Newworld 1042 0.34

Greatest ratio 3.12 - 1 2 - 1

Variance

Europe

Islamic

NewWorld

GROUP

0

50

100

150

200

PO

P_1

990

Europe

Islamic

NewWorld

GROUP

-1

0

1

2

3

LPO

P19

90

0.02

0.050.080.120.18

0.3

0.45

0.6

0.75

0.840.9

0.930.96

0 50 100 150

Pop_1990

0.02

0.050.080.120.18

0.3

0.45

0.6

0.75

0.840.9

0.930.96

0.2 0.4 1 2 3 4 6 10 20 30 50 100 200

Pop_1990

Nonparametric tests

• Usually based on ranks of the data• H0: samples come from populations with

identical distributions– equal means or medians

• Don’t assume particular underlying distribution of data– normal distributions not necessary

• Equal variances and independence still required

• Typically much less powerful than parametric tests

Mann-Whitney-Wilcoxon test

• Calculates sum of ranks in 2 samples– should be similar if H0 is true

• Compares rank sum to sampling distribution of rank sums– distribution of rank sums when H0 true

• Equivalent to t test on data transformed to ranks

Additional slides

A brief digression to re-sampling theory

Number inside Number outside 3 10 5 7 2 9 8 12 7 8

Mean 5 9.2

Traditional evaluation would probably involve a t test: another approach is re-sampling.

Treatment Number

Inside 3

Inside 5

Inside 2

Inside 8

Inside 7

Outside 10

Outside 7

Outside 9

Outside 12

Outside 8

1) Assume both treatments come from the same distribution

2) Resample groups of 5 observations, with replacement, but irrespective of treatment

Resampling

Treatment Number

Inside 3

Inside 5

Inside 2

Inside 8

Inside 7

Outside 10

Outside 7

Outside 9

Outside 12

Outside 8



Resampling

Treatment Number

Inside 3

Inside 5

Inside 2

Inside 8

Inside 7

Outside 10

Outside 7

Outside 9

Outside 12

Outside 8



3) Calculate mean for each group

Resampling

7.6

Treatment Number

Inside 3

Inside 5

Inside 2

Inside 8

Inside 7

Outside 10

Outside 7

Outside 9

Outside 12

Outside 8



3) Calculate mean for each group4) Repeat many times5) Calculate differences between pairs of means

(remember the null hypothesis is that there is no effect of treatment). This generates a distribution of differences.

Resampling

Mean 1 Mean 2 Difference

8 7.8 0.2

5.6 8.2 ‐2.6

6 9 ‐3

8 5 3

6 6 0

7 8 ‐1

6 6.8 ‐0.8

8 7.2 0.8

8 6.6 1.4

7 8.4 ‐1.4

6 5.4 0.6

7 6.4 0.6

6.4 6.8 ‐0.4

5 3.4 1.6

6.8 4.8 2

6.4 7.2 ‐0.8

7.2 8 ‐0.8

6.4 4.6 1.8

8.4 6 2.4

7.4 6.6 0.8

5.6 8.4 ‐2.8

8.2 6.2 2

7.8 8.4 ‐0.6

8.6 6.6 2

6 10.2 ‐4.2

6.8 5.6 1.2

6.4 7.8 ‐1.4

7.2 4.8 2.4

6.6 7.2 ‐0.6

7 5.2 1.8

6.6 9.8 ‐3.2

8.4 7.8 0.6

-10 -5 0 5 10

Difference in Means

0.0

0.1

0.2 Pro

po

rtion

pe

r Ba

r

0

50

100

150

200

250

Nu

mb

er

of O

bse

rva

tion

s 1000 observations

Distribution of differences

OK, now what?

Compare distribution of differences to real difference

Number inside Number outside 3 10 5 7 2 9 8 12 7 8

Mean 5 9.2

Real difference = 4.2

Estimate likelihood that real difference comes from two similar distributions

Mean 1 Mean 2 Difference

10.2 3.6 6.6 1

10 3.8 6.2 0.999

10.2 4.4 5.8 0.998

9.2 3.6 5.6 0.997

9.8 4.8 5 0.996

8.8 4.2 4.6 0.995

9.6 5.2 4.4 0.994

9.8 5.6 4.2 0.993

9.8 5.8 4 0.992

9.4 5.4 4 0.991

And on through 1000 differences

Proportion of differences less than current

Likelihood is 0.007 that distributions are the same

What are constraints of this sort of approach?

T-test vs resampling

Test P-valueResampling 0.007T-test 0.0093 Why the difference?

Additional examples

Worked example• Fecundity of predatory

gastropods:– sample of 37 and 42 egg capsule

of Lepsiella from littorinid zone and mussel zone respectively

• Counted number of eggs per capsule

• Null hypothesis:– no difference between zones in

mean number of eggs per capsule

• Ward & Quinn (1988), qk2002 Box 3.1

• Specify H0 and choose test statistic:

H0: M = L, i.e. population mean number of eggs per capsule from both zones are equal

The t statistic is appropriate test statistic for comparing 2 population means

• Specify a priori significance (probability) level ():

By convention, use = 0.05 (5%).

• Collect data, check assumptions,calculate test statistic from sample data:

Mean SD nLittorinid: 8.70 3.03 37

Mussel: 11.36 2.33 42

t = -5.39, df = 77

• Compare value of t statistic to its sampling distribution, the probability distribution of statistic (for specific df) when H0 is true– what is probability of obtaining t value of 5.39 or

greater from a t distribution with 77 df?

– what is probability of taking samples with observed or greater mean difference from 2 populations with same means?

• Probability (from JMP)

P = 0.001

• Look up in t table

P < 0.05

• If probability of obtaining this value or larger is less than , conclude H0 is “unlikely” to be true and reject it:– statistically significant result

• Our probability (<0.001) is less than 0.05 so reject H0:– statistically significant result.

• If probability of obtaining this value or larger is greater than , conclude that H0 is “likely” to be true and do not reject it:– statistically non-significant result

Presenting results of t test

• Methods:– An independent t test was used to compare the

mean number of eggs per capsule from the two zones. Assumptions were checked with….

• Results:– The mean number of eggs per capsule from the

mussel zone was significantly greater than that from the littorinid zone (t = 5.39, df = 77, P < 0.001; see Fig. 2).