Chapter 11
The Analysis of Variance
11.1 One Factor Analysis of Variance
NIPRLNIPRLNIPRLNIPRL
11.1 One Factor Analysis of Variance
11.2 Randomized Block Designs (not for this course)
11.1 One Factor Analysis of Variance11.1.1 One Factor Layouts (1/4)
• Suppose that an experimenter is interested in k populations with
unknown population means
• The one factor analysis of variance methodology is appropriate for
comparing three of more populations.
• The observation represents the j th observation from the i
th population.
1 2, , , kµ µ µK
ijx
NIPRLNIPRLNIPRLNIPRL 2
th population.
• The sample from population i consists of the observations
• If the sample sizes are all equal, then the data set is
balanced, and if the sample sizes are unequal, then the data set
is unbalanced.
in
1 , , ii inx xK
1 , , kn nK
11.1.1 One Factor Layouts (2/4)
• The total sample size of the data set is
• A data set of this kind is called a one-way or one factor layout.
• The single factor is said to have k levels corresponding to the k
populations under consideration.
• Completely randomized designs : the experiment is performed by
randomly allocating a total of “units” among the k populations.
• Modeling assumption
1T kn n n= + +L
Tn
NIPRLNIPRLNIPRLNIPRL 3
• Modeling assumption
where the error terms
• Equivalently,
ij i ijx µ ε= +
2~ (0, )iid
ij Nε σ
2~ ( , )iid
ij ix N µ σ
11.1.1 One Factor Layouts (3/4)
• Point estimates of the unknown population means
•
1, 1i
i inii
i
x xx i k
nµ
+ += = ≤ ≤�
L
0 1:
: , for some and
kH
H i j
µ µ
µ µ
= =
≠
L
NIPRLNIPRLNIPRLNIPRL 4
• Acceptance of the null hypothesis indicates that there is no evidence
that any of the population means are unequal.
• Rejection of the null hypothesis implies that there is evidence that at
least some of the population means are unequal.
: , for some andA i jH i jµ µ≠
11.1.1 One Factor Layouts (4/4)
• Example 60 : Collapse of Blocked Arteries
– level 1 : stenosis = 0.78
level 2 : stenosis = 0.71
level 3 : stenosis = 0.65
– � 10.6 8.3+ +L
NIPRLNIPRLNIPRLNIPRL 5
–
–
�
�
�
11
22
33
10.6 8.311.209
11
11.7 17.615.086
14
19.6 16.617.330
10
x
x
x
µ
µ
µ
+ += = =
+ += = =
+ += = =
�
�
�
L
L
L
0 1 2 3:H µ µ µ= =
11.1.2 Partitioning the Total Sum of Squares (1/5)
• Treatment Sum of Squares
–
–
1 111 k kknk
T T
x xn x n xx
n n
+ ++ += =� ���
LL
2SSTr ( )k
n x x= −∑
NIPRLNIPRLNIPRLNIPRL 6
–
– A measure of the variability between the factor levels.
–
2
1
SSTr ( )ii
i
n x x=
= −∑ � ��
2 22
1 1 1 1
2 2 2 2 2
1 1
SSTr ( ) 2
2
i i i
i iT T
k k k k
i i i i
i i i i
k k
i T i
i i
n x x n x n x x n x
n x n x n x n x n x
= = = =
= =
= − = − +
= − + = −
∑ ∑ ∑ ∑
∑ ∑
� �� � � �� ��
� �� �� � ��
11.1.2 Partitioning the Total Sum of Squares (2/5)
• Error sum of squares
–
– A measure of the variability within the factor levels.
2
1 1
SSE ( )i
i
nk
ij
i j
x x= =
= −∑∑ �
NIPRLNIPRLNIPRLNIPRL 7
– A measure of the variability within the factor levels.
–22 2
1 1 1 1 1 1 1 1
2 2 22 2
1 1 1 1 1 1 1
SSE ( ) 2
2
i i i i
i i i
i i
i i i
n n n nk k k k
ij ij ij
i j i j i j i j
n nk k k k k
ij i i ij i
i j i i i j i
x x x x x x
x n x n x x n x
= = = = = = = =
= = = = = = =
= − = − +
= − + = −
∑∑ ∑∑ ∑∑ ∑∑
∑∑ ∑ ∑ ∑∑ ∑
� � �
� � �
11.1.2 Partitioning the Total Sum of Squares (3/5)
• Total sum of squares
–
– A measure of the total variability in the data set
2
1 1
SST ( )ink
ij
i j
x x= =
= −∑∑ ��
NIPRLNIPRLNIPRLNIPRL 8
–
– SST = SSTr + SSE
22 2
1 1 1 1 1 1 1 1
2 2 22 2
1 1 1 1
SST ( ) 2
2
i i i i
i i
n n n nk k k k
ij ij ij
i j i j i j i j
n nk k
ij T T ij T
i j i j
x x x x x x
x n x n x x n x
= = = = = = = =
= = = =
= − = − +
= − + = −
∑∑ ∑∑ ∑∑ ∑∑
∑∑ ∑∑
�� �� ��
�� �� ��
11.1.2 Partitioning the Total Sum of Squares (4/5)
• P-value considerations
– The plausibility of the null hypothesis that the factor level means
are all equal depends upon the relative size of the sum of
squares for treatments, SSTr, to the sum of squares for error,
SSE.
NIPRLNIPRLNIPRLNIPRL 9
11.1.2 Partitioning the Total Sum of Squares (5/5)
• Example 60 : Collapse of Blocked Arteries
–1 2 3
2 2 2
11.209, 15.086, 17.330
10.6 16.614.509
35
10.6 16.6 7710.39ink
x x x
x
x
= = =
+= =
= + + =∑∑
� � �
��
L
L
NIPRLNIPRLNIPRLNIPRL 10
– SSE=SST-SSTr=342.5-204.0=138.5
1 1
22 2
1 1
2 2
1
2 2
10.6 16.6 7710.39
7710.39 (35 14.509 ) 342.5
(11 11.209 ) (14 15.086 ) (10 1
ij
i
ij
i
i j
nk
T
i j
k
i T
i
x
SST x n x
SSTr n x n x
= =
= =
=
= + + =
= − = − × =
= −
= × + × + ×
∑∑
∑∑
∑
��
� ��
L
2 27.330 ) (35 14.509 )
204.0
− ×
=
11.1.3 The Analysis of Variance Table (1/5)
• Mean square error
2
12( )
1
i
i
n
ijj
i
i
x xs
n
=−
=−
∑ �
SSE SSE
2
1
( 1)k
i i
i
SSE n s=
= −∑
NIPRLNIPRLNIPRLNIPRL 11
–
–
–
– So, MSE is an unbiased point estimate of the error variance
SSE SSEMSE = =
d.f. Tn k−
2
2 2MSE ~ since ( )TT
n k
n k T
T
E n kn k
χσ χ− − = −−
2(MSE)E σ=
2σ
11.1.3 The Analysis of Variance Table (2/5)
• Mean squares for treatments
–
–
– If the factor level means are all equal,
SSTr SSTrMSTr = =
d.f. -1k2
2 1 1 1( )
(MSTr) ( ) where1
?
k
i ii k k
T
whyn n n
Ek n
µ µ µ µσ =
− += +
−∑ L
µ
NIPRLNIPRLNIPRLNIPRL 12
– If the factor level means are all equal,
then
• These results can be used to develop a method for calculating the p-
value of the null hypothesis
– When this null hypothesis is true, then F -statistic
iµ2
2 2 1(MSTr) and MST ?r ~ ( )1
k wE hk
yχ
σ σ −= −
2
11,2
/( 1)MSTr~ ( )
MSE /( )?
T
T
kk n k
n k T
kF F why
n k
χχ
−− −
−
−= =
−
11.1.3 The Analysis of Variance Table (3/5)
•1,-value ( ) where ~ Tk n kp P X F X F − −= ≥
1, Tk n kF − − distribution
NIPRLNIPRLNIPRLNIPRL 13
p −value
MSTr
MSEF =
11.1.3 The Analysis of Variance Table (4/5)
• Analysis of variance table for one factor layout
Source d.f.
Sum of
square
s
Mean squares F-statistic P-value
SSTr MSTr
NIPRLNIPRLNIPRLNIPRL 14
Treatments SSTr
Error SSE
total SST
1k −
Tn k−
1Tn −
SSTrMSTr
1k=
−
SSEMSE
Tn k=
−
MSTr
MSEF =
1,( )Tk n kP F F− − ≥
11.1.3 The Analysis of Variance Table (5/5)
• Example 60 : Collapse of Blocked Arteries
– The degrees of freedom for treatments is
– The degrees of freedom for error is
SSTr 204.0MSTr 102.0
2 2= = =
35 3 32Tn k− = − =
1 3 1 2k − = − =
SSE 138.5MSE 4.33= = =
NIPRLNIPRLNIPRLNIPRL 15
–
– Consequently, the null hypothesis that the average flowrate at collapse is the same for all three amounts of stenosis is not plausible.
SSE 138.5MSE 4.33
32 32= = =
MSTr 102.023.6
MSE 4.33F = = =
1,( 23.6) 0 where ~ Tk n kp P X X F − −− = ≥ �value
11.1.4 Pairwise Comparisons of the Factor Level Means
• When the null hypothesis is rejected, the experimenter can follow up
the analysis with pairwise comparisons of the factor level means to
discover which ones have been shown to be different and by how
much.
• With k factor levels there are k(k-1)/2 pairwise differences
NIPRLNIPRLNIPRLNIPRL 16
• With k factor levels there are k(k-1)/2 pairwise differences
1 2 , 1 21i i i i kµ µ− ≤ < ≤
• A set of confidence level simultaneous confidence intervals for these
pairwise differences are
�
1 2 1 21 2
1 2 1 2
, , , ,1 1 1 1,
2 2
where MSE
k v k vi i i ii i
i i i i
q qx x s x x s
n n n n
s
α αµ µ
σ
− ∈ − − + − + +
= =
� � � �
1 α−
NIPRLNIPRLNIPRLNIPRL 17
– (see pp. 888-889) is a critical point that is the upper point of the
Studentized range distribution with parameter and degrees of
freedom .
�where MSEs σ= =
, ,k vqα
Tv n k= −
αk
• These confidence intervals are similar to the t-intervals
– Difference : is used instead of
– T-intervals have an individual confidence level whereas this set
of simultaneous confidence intervals have an overall confidence
level
– All of the k(k-1)/2 confidence intervals contain their respective
, , / 2qα κ ν / 2,tα ν
NIPRLNIPRLNIPRLNIPRL 18
– All of the k(k-1)/2 confidence intervals contain their respective
parameter value
– is larger than
• If the confidence interval for the difference contains zero,
then there is no evidence that the means at factor levels and
are different
1 2i iµ µ−
, , / 2qα κ ν / 2,tα ν
1 2i iµ µ−
1i 2i
• Example 60 : Collapse of Blocked Arteries
–
– With 32 degrees of freedom for error, the critical pt is
– the overall confidence level is 1-0.05=0.95
4.33 2.080s MSEσ= = = =
0.05,3,32 3.48q =
NIPRLNIPRLNIPRLNIPRL 19
– Individual confidence intervals have confidence levels of
1-0.0196 0.98
– The confidence interval for
�
1 2
2.080 3.48 1 1 2.080 3.48 1 111.209 15.086 ,11.209 15.086 ,
11 14 11 142 2
( 3.877 2.062, 3.877 2.062) ( 5.939, 1.814)
µ µ × ×
− ∈ − − + − + +
= − − − + = − −
1 2µ µ−
1 3
2.080 3.48 1 1 2.080 3.48 1 111.209 17.330 ,11.209 17.330 ,
11 10 11 102 2
( 6.121 2.236, 6.121 2.236) ( 8.357, 3.884)
µ µ × ×
− ∈ − − + − + +
= − − − + = − −
× ×
NIPRLNIPRLNIPRLNIPRL 20
– None of these three confidence intervals contains zero, and so
the experiment has established that each of the three stenosis
levels results in a different average flow rate at collapse.
2 3
2.080 3.48 1 1 2.080 3.48 1 115.086 17.330 ,15.086 17.330 ,
14 10 14 102 2
( 2.244 2.119, 2.244 2.119) ( 4.364, 0.125)
µ µ × ×
− ∈ − − + − + +
= − − − + = − −
11.1.5 Sample Size Determination
• The sensitivity afforded by a one factor analysis of variance depends
upon the k sample sizes
• The power of the test of the null hypothesis that the factor level
means are all equal increase as the sample sizes increase.
• An increase in the sample size results in a decrease in the lengths
of the pairwise confidence intervals.
1, , kn nK
NIPRLNIPRLNIPRLNIPRL 21
of the pairwise confidence intervals.
• If the sample sizes are unequal,
• If the sample sizes are all equal to n,
in
1 2
, ,
1 12
i i
L sqn n
α κ ν= +
, ,2 /L sq nα κ ν=
• If prior to experimentation, an experimenter decides that a
confidence interval length no large than L is required, then the
sample size is
– The experimenter needs to estimate the value of
– The critical point gets larger as the number of factor levels
, ,
2 2
2
4 s qn
L
α κ ν�
�s σ=q
NIPRLNIPRLNIPRLNIPRL 22
– The critical point gets larger as the number of factor levels
increases, which results in a larger sample size required., ,qα κ ν
k
11.1.6 Model Assumptions
• Modeling assumption of the analysis of variance
– Observations are distributed independently with normal
distribution that has a common variance
– The independence of the data observations can be judged from
the manner in which a data set is collected.
NIPRLNIPRLNIPRLNIPRL 23
– The ANOVA is fairly robust to the distribution of data, so that it
provides fairly accurate results as long as the distribution is not
very far from a normal distribution.
– The equality of the variances for each of the k factor levels can
be judged from a comparison of the sample variances or from a
comparison of the lengths of boxplots of the observations at
each factor level.
Summary problems
1. Can the equality of two population means be tested by ANOVA?
2. When does the F-statistic follow an F-distribution?
3. Why do you use the q-values instead of the t-quantiles in
pairwise comparisons of multiple means?
NIPRLNIPRLNIPRLNIPRL 24
pairwise comparisons of multiple means?
4. Does follow a chi-square distribution under both of
a null and its alternative hypothesis?
2/SSE σ