+ All Categories
Home > Documents > Statistical Hypothesis Testing with SAS and R (Taeger/Statistical Hypothesis Testing with SAS and R)...

Statistical Hypothesis Testing with SAS and R (Taeger/Statistical Hypothesis Testing with SAS and R)...

Date post: 23-Dec-2016
Category:
Upload: sonja
View: 213 times
Download: 0 times
Share this document with a friend
11
17 Tests in variance analysis Analysis of variance (ANOVA) in its simplest form analyzes if the mean of a Gaussian ran- dom variable differs in a number of groups. Often the factor which determines each group is given by applying different treatments to subjects, for example, in designed experiments in technical applications or in clinical studies. The problem can thereby be seen as comparing group means, which extends the t-test to more than two groups. The underlying statistical model may also be presented as a special case of a linear model. In Section 17.1 we han- dle the one- and two-way cases of ANOVA. The two-way case extends the treated problem to groups characterized by two factors. In this case it is also of interest if the two factors influence each other in their effect on the measured variable, and hence show an interaction effect. One of the crucial assumptions of an ANOVA is the homogeneity of variance within all groups. Section 17.2 deals with tests to check this assumption. 17.1 Analysis of variance 17.1.1 One-way ANOVA Description: Tests if the mean of a Gaussian random variable is the same in I groups. Assumptions: Let Y i1 , , Y in i , i ∈{1, , I }, be I independent samples of inde- pendent Gaussian random variables with the same variance but possibly different group means. The sample sizes of the I samples are n 1 , , n I with n = I i=1 n i . The random variables Y ij can be modeled as Y ij = i + e ij with e ij N(0, 2 ), ij R. Hypotheses: H 0 1 =···= I vs H 1 i k for at least one i k. Statistical Hypothesis Testing with SAS and R, First Edition. Dirk Taeger and Sonja Kuhnt. © 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.
Transcript

17

Tests in variance analysis

Analysis of variance (ANOVA) in its simplest form analyzes if the mean of a Gaussian ran-dom variable differs in a number of groups. Often the factor which determines each group isgiven by applying different treatments to subjects, for example, in designed experiments intechnical applications or in clinical studies. The problem can thereby be seen as comparinggroup means, which extends the t-test to more than two groups. The underlying statisticalmodel may also be presented as a special case of a linear model. In Section 17.1 we han-dle the one- and two-way cases of ANOVA. The two-way case extends the treated problemto groups characterized by two factors. In this case it is also of interest if the two factorsinfluence each other in their effect on the measured variable, and hence show an interactioneffect. One of the crucial assumptions of an ANOVA is the homogeneity of variance withinall groups. Section 17.2 deals with tests to check this assumption.

17.1 Analysis of variance

17.1.1 One-way ANOVA

Description: Tests if themean of a Gaussian random variable is the same in I groups.

Assumptions: • Let Yi1,… , Yini , i ∈ {1,… , I}, be I independent samples of inde-pendent Gaussian random variables with the same variance butpossibly different group means.

• The sample sizes of the I samples are n1,… , nI with

n =I∑i=1ni.

• The random variables Yij can be modeled as Yij = 𝜇i + eij with eij ∼N(0, 𝜎2), 𝜇ij ∈ R.

Hypotheses: H0 ∶ 𝜇1 = · · · = 𝜇I vs H1 ∶ 𝜇i ≠ 𝜇k for at least one i ≠ k.

Statistical Hypothesis Testing with SAS and R, First Edition. Dirk Taeger and Sonja Kuhnt.© 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.

254 STATISTICAL HYPOTHESIS TESTING WITH SAS AND R

Test statistic:

F =

I∑i=1ni(Yi+ − Y++)2

/(I − 1)

I∑i=1

ni∑j=1

(Yij − Yi+)2/(n − I)

with Yi+ = 1

ni

ni∑j=1

Yij and Y++ = 1

n

I∑i=1

ni∑j=1

Yij

Test decision: Reject H0 if for the observed value F0 of FF0 > f1−𝛼;I−1,n−I

p-values: p = 1 − P(F ≤ F0)

Annotations: • The test statistic F is FI−1,n−I-distributed (Rencher 1998, chapter 4).• f1−𝛼; I−1, n−I is the (1 − 𝛼)-quantile of the F-distribution with I − 1and n − I degrees of freedom.

• The numerator of the test statistic is also called MST (mean sumof squares for treatment) and the denominator MSE (mean sum ofsquares of errors).

• Note that we have presented the one-waymodel and test for themoregeneral case of an unbalanced design where the sample sizes in thedifferent groups may vary. A balanced design is characterized by anequal number of observations in each group.

Example: To test if the means of the harvest in kilograms of tomatoes in three dif-ferent greenhouses differ. The dataset contains observations from five fields in eachgreenhouse (dataset in Table A.12).

SAS code

proc anova data = crop;class housemodel kg = house;run;quit;

SAS output

Source DF Anova SS Mean Square F Value Pr > Fhouse 2 0.16329333 0.08164667 0.33 0.7262

Remarks:

• The SAS procedure proc anova is the standard procedure for the analysisof variance with a balanced design as given in this example. For an unbalanceddesign the procedure proc glm should be used (see below).

TESTS IN VARIANCE ANALYSIS 255

• By using the class statement, SAS treats the variable house as a categoricalvariable.

• The code model dependent variable= independent variable defines the model.

• The quit; statement is used to terminate the procedure; proc anova is aninteractive procedure and SAS then knows not to expect any further input.

• The program code for proc glm is similar:

proc glm data = crop;class housemodel kg = house;

run;quit;

R code

summary(aov(crop$kg∼factor(crop$house)))

R output

Df Sum Sq Mean Sq F value Pr(>F)factor(crop$house) 2 0.1633 0.08165 0.329 0.726Residuals 12 2.9815 0.24846

Remarks:

• The function aov() performs an analysis of variance in R. The response variableis placed on the left-hand side of the ∼ symbol and the independent variableswhich define the groups on the right-hand side.

• We use the R function factor() to tell R that house is a categorical variable.

• The summary function gets R to return the sum of squares, degrees of freedom,p-values, etc.

17.1.2 Two-way ANOVA

Description: Tests if the mean of a Gaussian random variable is the same in groupsdefined by two factors of interest.

Assumptions: • Let Yijk, i = 1,… , I, j = 1,… , J, k = 1,… ,K describe a sample ofsize n = IJK of independent Gaussian random variables.

• In each of the IJ groups defined by the two factors, we have an equalnumber of K observations (balanced design).

256 STATISTICAL HYPOTHESIS TESTING WITH SAS AND R

• Each of the variables Yijk can be modeled as

Yijk = 𝜇 + 𝛼i + 𝛽j + (𝛼𝛽)ij + eijk with eijk ∼ N(0, 𝜎2), where 𝜇 is theoverall mean and 𝛼i and 𝛽j are the deviations from it for the first andthe second factor and (𝛼𝛽)ij describes the interaction between them.

Hypotheses: (A) H0 ∶ (𝛼𝛽)11 = … = (𝛼𝛽)IJ = 0vs H1 ∶ (𝛼𝛽)ij ≠ 0 for at least one pair (i, j)

(B) H0 ∶ 𝛼1 = … = 𝛼I = 0vs H1 ∶ 𝛼i ≠ 0 for at least one 𝛼i

(C) H0 ∶ 𝛽1 = … = 𝛽J = 0vs H1 ∶ 𝛽j ≠ 0 for at least one 𝛽j

Test statistic:

(A) FA =K

I∑i=1

J∑j=1

(Yij+ − Yi++ − Y+j+ + Y+++)2/(I − 1)(J − 1)

I∑i=1

J∑j=1

K∑k=1

(Yijk − Yij+)2/IJ(K − 1)

(B) FB =KJ

I∑i=1

(Yi++ − Y+++)2/(I − 1)

I∑i=1

J∑j=1

K∑k=1

(Yijk − Yij+)2/IJ(K − 1)

(C) FC =KI

J∑j=1

(Y+j+ − Y+++)2/(J − 1)

I∑i=1

J∑j=1

K∑k=1

(Yijk − Yij+)2/IJ(K − 1)

with

Yij+ = 1

K

K∑k=1

Yijk Y+++ = 1

IJK

I∑i=1

J∑j=1

K∑k=1

Yijk

Yi++ = 1

JK

J∑j=1

K∑k=1

Yijk Y+j+ = 1

IK

I∑i=1

K∑k=1

Yijk

Test decision: Reject H0 if for the observed value F0 of FA, FB or FC(A) F0 > f1−𝛼;(I−1)(J−1),IJ(K−1)(B) F0 > f1−𝛼;(I−1),IJ(K−1)(C) F0 > f1−𝛼;(J−1),IJ(K−1)

p-values: p = 1 − P(F ≤ F0)

Annotations: • The test statistic F is F-distributed with (I − 1)(J − 1) (A), (I − 1)(B) or (J − 1) degrees of freedom for the nominator and IJ(K − 1)degrees of freedom for the denominator (Montgomery and Runger2007, chapter 14).

TESTS IN VARIANCE ANALYSIS 257

• f1−𝛼;r,s is the 1 − 𝛼-quantile of the F-distribution with r and s degreesof freedom.

• Hypothesis (A) tests if there is an interaction between the two fac-tors. Hypotheses (A) and (B) are testing the main effects of the twofactors.

Example: To test if the means of the harvest in kilograms of tomatoes in three differentgreenhouses and using five different fertilizers differ. The dataset contains observationsfrom five fields with each fertilizer in each greenhouse (dataset in Table A.12).

SAS code

proc anova data= crop;class house fertilizer;model kg = house fertilizer;

run;quit;

SAS output

The ANOVA Procedure

Dependent Variable: kg

Source DF Anova SS Mean Square F Value Pr > Fhouse 2 0.16329333 0.08164667 0.50 0.6268fertilizer 4 1.66337333 0.41584333 2.52 0.1235

Remarks:

• The SAS procedure proc anova is the standard procedure for an ANOVAwitha balanced design. For an unbalanced design the procedure proc glm shouldbe used.

• By using the class statement, SAS treats the variables house and fertil-izer as categorical variables.

• The code model dependent variable= independent variables defines themodel. To incorporate an interaction term a star is used, for example,variable1⋆variable2.

• The quit; statement is used to terminate the procedure; proc anova is aninteractive procedure and SAS then knows not to expect any further input.

• The program code for proc glm is similar:

proc glm data = crop;class house fertilizermodel kg = house fertilizer;

run;quit;

258 STATISTICAL HYPOTHESIS TESTING WITH SAS AND R

R code

kg<-crop$kgfield<-crop$housefertilizer<-crop$fertilizer

summary(aov(kg∼factor(field)+factor(fertilizer)))

R output

Df Sum Sq Mean Sq F value Pr(>F)factor(house) 2 0.1633 0.0816 0.496 0.627factor(fertilizer) 4 1.6634 0.4158 2.524 0.123Residuals 8 1.3181 0.1648

Remarks:

• The function aov() performs an ANOVA in R. The response variable is placedon the left-hand side of the ∼ symbol and the independent variables which definethe groups on the right-hand side separated by a plus (+). To incorporate an inter-action term a star is used, for example, variable1⋆variable2.

• We use the R function factor() to tell R that house is a categorical variable.

• The summary function gets R to return the sum of squares, degrees of freedom,p-values, etc.

17.2 Tests for homogeneity of variances

17.2.1 Bartlett test

Description: Tests if the variances of k Gaussian populations differ from each other.

Assumptions: • Data are measured on an interval or ratio scale.• Data are randomly sampled from k independent Gaussiandistributions.

• The k random variablesX1,… ,Xk fromwhere the samples are drawnhave variances 𝜎2

1,… , 𝜎2

k .

• Further (Xj1,… ,Xjnj) is the jth sample with nj observations,

j ∈ {1,… , k}.

Hypotheses: H0 ∶ 𝜎21= · · · = 𝜎2

k vs H1 ∶ 𝜎l ≠ 𝜎j for at least one l ≠ j

TESTS IN VARIANCE ANALYSIS 259

Test statistic:

X2 =r ln

(∑kj=1

nj−1rs2j)−

k∑j=1

(nj − 1) ln(s2j )

1 + 1

3(k−1)

([k∑j=1

1

nj−1)

]− 1

r

)with s2j =

1

nj−1

nj∑i=1

(Xji − Xj+)2, Xj+ = 1

njXji

and r =k∑j=1

(nj − 1)

Test decision: Reject H0 if for the observed value X20of X2

X20> 𝜒2

1−𝛼;k−1

p-values: p = 1 − P(X2 ≤ X20)

Annotations: • The test statistic X2 is 𝜒2k−1-distributed (Glaser 1976).

• 𝜒21−𝛼;k−1 is the 1 − 𝛼-quantile of the 𝜒2-distribution with k − 1

degrees of freedom.• This test was introduced by Maurice Bartlett (1937).• The Bartlett test is very sensitive to the violation of the Gaussianassumption. If the samples are not Gaussian distributed an alterna-tive is Levene’s test (Test 17.2.2).

Example: To test if the variances of the harvest in kilograms of tomatoes in three dif-ferent greenhouses are the same (dataset in Table A.12).

SAS code

proc glm data = crop;class house;model kg = house;means house /hovtest=BARTLETT ;run;quit;

SAS output

The GLM Procedure

Bartlett’s Test for Homogeneity of kg Variance

Source DF Chi-Square Pr > ChiSqhouse 2 2.1346 0.3439

260 STATISTICAL HYPOTHESIS TESTING WITH SAS AND R

Remarks:

• The SAS procedure proc glm provides the Bartlett test.

• The first lines of code are enabling an ANOVA (see Test 16.2.1).

• The code means house /hovtest=BARTLETT lets SAS conduct theBartlett test.

R code

bartlett.test(crop$kg∼crop$house)

R output

Bartlett test of homogeneity of variances

data: crop$kg by crop$fieldBartlett’s K-squared = 2.1346, df = 2, p-value = 0.3439

Remarks:

• The function bartlett.test() conducts the Bartlett test.

• The analysis variable is coded on the left-hand side of the∼ and the group variableon the right-hand side.

17.2.2 Levene test

Description: Tests if the variances of k populations differ from each other.

Assumptions: • Data are measured on an interval or ratio scale.• Data are randomly sampled from k independent random variablesX1,… ,Xk with variances 𝜎2

1,… , 𝜎2

k .

• Further (Xj1,… ,Xjnj) is the jth sample with nj observations,

j ∈ {1,… , k}.

Hypotheses: H0 ∶ 𝜎21= · · · = 𝜎2

k vs H1 ∶ 𝜎l ≠ 𝜎j for at least one l ≠ j.

Test statistic:

L =

(∑kj=1(nj − 1)

) k∑j=1nj(Zj+ − Z++)2

(k − 1)k∑j=1

nj∑i=1

(Zji − Zj+)2with Zji = |Xji − Xj+|,

Xj+ = 1

nj

nj∑i=1Xji , Zj+ = 1

nj

nj∑i=1Zji and Z++ = 1

n

k∑j=1

nj∑i=1Zji

TESTS IN VARIANCE ANALYSIS 261

Test decision: Reject H0 if for the observed value L0 of LL0 > f1−𝛼;k−1,∑k

j=1(nj−1)

p-values: p = 1 − P(F ≤ L0)

Annotations: • The test statistic L is Fk−1,∑kj=1(nj−1)

-distributed.

• f1−𝛼;k−1,∑kj=1(nj−1)

is the 1 − 𝛼-quantile of the F-distribution with k − 1

and∑k

j=1(nj − 1) degrees of freedom.• This test was introduced by Howard Levene 1960. In 1974 MortonBrown andAlan Forsythe proposed the use of themedian or trimmedmean instead of themean for calculating the Zij (Brown and Forsythe1974). This test is called the Brown–Forsythe test.

• This test does not need the assumption of underlying Gaussian dis-tributions and should be used if that assumption is doubtful. Ifthe data are Gaussian distributed Bartlett’s test can be used (seeTest 17.2.1).

Example: To test if the variances of the harvest in kilograms of tomatoes in three dif-ferent greenhouses are the same (dataset in Table A.12).

SAS code

proc glm data = crop;class house;model kg = house;means house /hovtest=levene (TYPE=ABS) ;run;quit;

SAS output

Levene’s Test for Homogeneity of kg VarianceANOVA of Absolute Deviations from Group Means

Sum of MeanSource DF Squares Square F Value Pr > Fhouse 2 0.2675 0.1337 2.79 0.1012Error 12 0.5753 0.0479

Remarks:

• The SAS procedure proc glm provides the Levene test.

• The first lines of code are enabling an ANOVA (see Test 16.2.1).

• The code means house /hovtest=levene (TYPE=ABS) lets SAS dothe Levene test. In SAS it is also possible to choose the option (TYPE=SQUARE)which uses the squared differences.

• The Brown–Forsythe test can be conducted with the option /hovtest=BF.

262 STATISTICAL HYPOTHESIS TESTING WITH SAS AND R

R code

# Calculate group means for each fieldm<-tapply(crop$kg,crop$house,mean)

# Calculate the Z’sz<-abs(crop$kg-m[crop$house])

# Overall mean of the Z’sz_mean=mean(z)

# Group mean of the Z’sz_gm<-tapply(z,crop$house,mean)

# Make a matrix of the Z’s (group in the rows)z_matrix<-rbind(z[crop$house==1],z[crop$house==2],

z[crop$house==3])

# Calculate the numeratornu<-0for (i in 1:3){u<-5*(z_gm[i]-z_mean)∧2nu<-nu+u}

# Calculate the denominatorde<-0for (j in 1:3){for (i in 1:5){e<-(z_matrix[j,i]-z_gm[j])∧2de<-de+e

}}

# Calculate test statistic and p-valuel<-(12*nu)/(2*de)p_value<-1-pf(l,2,12)

# Output results"Levene Test"lp_value

R output

[1] "Levene Test"> l

12.789499

TESTS IN VARIANCE ANALYSIS 263

> p_value1

0.1011865>

Remarks:

• There is no basic R function to calculate Levene’s test directly.

• In this example we have k = 3 and∑k

j=1(nj − 1) = 12. The respective parts mustbe adopted if other data are used.

• To apply the Brown–Forsythe test just change the first line of code tom<-tapply(crop$kg,crop$house,median).

References

Bartlett M.S. Properties of sufficiency and statistical tests. Proceedings of the Royal Statistical SocietySeries A 160, 268–282.

Brown M.B. and Forsythe A.B. 1974 Robust tests for the equality of variances. Journal of the Amer-ican Statistical Association 69, 364–367.

Glaser R.E. 1976 Exact critical values for Bartletts test for homogeneity of variances. Journal of theAmerican Statistical Association 71, 488–490.

Levene H. 1960. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling(eds Olkin I et al.), pp. 278–292. Stanford University Press.

Montgomery D.C. and Runger G.C. 2007 Applied Statistics and Probability for Engineers, 4th edn.John Wiley & Sons, Ltd.

Rencher A.C. 1998Multivariate Statistical Inference and Applications. John Wiley & Sons, Ltd.


Recommended