Anova

Notes on Analysis of Variance: Old School

John I. Marden

Copyright 2003

2

Chapter 1

Introduction to Linear Models

These notes are based on a course I taught using the text Plane Answers to ComplexQuestions by Ronald Christensen (Third edition, 2002, Springer-Verlag). Hence, everythingthroughout these pages implicitly uses that book as a reference. So keep a copy handy! Buteverything here is my own interpretation.

1.1 Dependent and explanatory variables

How is height related to weight? How are sex and age related to heart disease? What factorsinfluence crime rate? Questions such as these have one dependent variable of interest, andone or more explanatory variables. The goal is to assess the relationship of the explanatoryvariables to the dependent variable. Examples:

Dependent Variable Explanatory VariablesWeight HeightCholesterol level Fat intakeHeart function Age, sexCrime rate Population density, Average income, Educational levelBacterial count Drug

Linear models model the relationship by writing the mean of the dependent variable asa linear combination of the explanatory variables, or some representations of the explanatoryvariables. For example, a linear model relating cholesterol level to the percentage of fat inthe diet would be

cholesterol = β0 + β1(fat) + residual. (1.1)

The intercept β0 and slope β1 are parameters, usually unknown and to be estimated. Onedoes not expect the cholesterol level to be an exact function of fat. Rather, there will berandom variation: Two people with the same fat intake will likely have different cholesterollevels, just as two people of the same height will have different weights. The residual isthe part of the dependent variable not explained by the linear function of the explanatoryvariables. As we go along, we will make other assumptions about the residuals, but the key

3

4 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

one at this point is that they have mean 0. That is, the dependent variable is on averageequal to the linear function of the explanatory variables.

It is easy to think of more complicated models that are still linear, e.g.,

cholesterol = β0 + β1(fat) + γ(exercise) + residual, (1.2)

or

cholesterol = β0 + β1(fat) + β3(fat)2 + residual. (1.3)

“Wait!” you might say. That last equation is not linear, it is quadratic: the mean of choles-terol is a parabolic function of fat intake. Here is one of the strengths of linear models:The linearity is in the parameters, so that one or more representations of the explanatoryvariables can appear (e.g., here represented by fat and fat2), as long as they are combinedlinearly. An example of a non-linear model:

cholesterol = β0eβ1(fat) + residual. (1.4)

This model is perfectly fine, just not a linear model.

A particular type of linear model, used when the explanatory variables are categorical, isthe analysis of variance model, which is the main focus of this course. A categorical variableis one whose values are not-necessarily numerical. One study measured the bacterial countof leprosy patients, where each patient was given one of three treatment: Drug A, Drug D,or a placebo. The explanatory variable is the treatment, but it is not numerical. One wayto represent treatment is with 0-1 variables, say a, d, and p:

a =

{1 if treatment is Drug A0 if treatment is not Drug A

, d =

{1 if treatment is Drug D0 if treatment is not Drug D

, (1.5)

and

p =

{1 if treatment is the placebo0 if treatment is not the placebo

. (1.6)

If someone received Drug A, that patient’s values for the representations would be a = 1, d =0, p = 0. Similarly, one receiving the placebo would have a = 0, d = 0, p = 1. (Note thatexactly one of a, d, p will be 1, the others 0.) One can go the other way: knowing a, d, p, it iseasy to figure out what the treatment is. In fact, you only need to know two of them, e.g.,a and d. A linear model constructed from these representations is then

bacterial count = β1a + β2d + β3p + residual. (1.7)

If the residual has mean 0, then one can see that β1 is the mean of patients who receive DrugA, β2 is the mean of those who receive Drug D, and β3 is the mean for the control group.

1.2. MATRIX NOTATION 5

1.2 Matrix notation

The values for the dependent variable will be denoted using y’s. The representations of theexplanatory variables will usually be denoted using x’s, although other letters may show up.The y’s and x’s will have subscripts to indicate individual and variable. These subscriptsmay be single or may be multiple, whichever seems most useful at the time. The residualsuse e’s, and the parameters are usually β’s, but can be any other Greek letter as well.

The next step is to write the model in a universal matrix notation. The dependentvariable is an n × 1 vector y, where n is the number of observations. The representationsof the explanatory variables are in the n × p matrix X, where the jth column of X containsthe values for the n observations on the jth representation. The β is the p × 1 vector ofcoefficients. Finally, e is the n × q vector of residuals. Putting these together, we have themodel

y = Xβ + e. (1.8)

(Generally, column vectors will be denoted by underlining, and matrices will be bold.) Thefollowing examples show how to set up this matrix notation.

Simple linear regression. Simple linear regression has one x, as in (1.1). If there are nobservations, then the linear model would be written

yi = β0 + β1xi + ei, i = 1, . . . , n. (1.9)

The y, e, and β are easy to construct:

y =

y1

y2...yn

, e =

e1

e2...en

and β =

(β0

β1

). (1.10)

For X, we need a vector for the xi’s, but also a vector of 1’s, which are surreptitiouslymultiplying the β0:

X =

1 x1

1 x2...

...1 xn

. (1.11)

Check to see that putting (1.10) and (1.11) in (1.8) yields the model (1.9). Note that p = 2,that is, X has two columns even though there is only one x.

Another useful way to look at the model is to let 1n be the n × 1 vector of 1’s, and x bethe vector of the xi’s, so that X = (1n, x), and

y = β01n + β1x + e. (1.12)

(The text uses J to denote 1n, which is fine.)


Multiple linear regression. When there are more than one explanatory variables, as in(1.2), we need an extra subscript for x, so that xi1 is the fat value and xi2 is the exerciselevel for person i:

yi = β0 + β1xi1 + β2xi2 + ei, i = 1, . . . , n. (1.13)

With q variables, the model would be

yi = β0 + β1xi1 + · · · + βqxiq + ei, i = 1, . . . , n. (1.14)

Notice that the quadratic model (1.3) is of this form with xi1 = xi and xi2 = x2i :

yi = β0 + β1xi + β2x2i + ei, i = 1, . . . , n. (1.15)

The general model (1.14) has the form (1.8) with a longer β and wider X:

y = Xβ + e =

1 x11 x12 · · · x1q

1 x21 x22 · · · x2q...

...... · · · ...

1 xn1 xn2 · · · xnq

β0

β1

β2...βq

+ e. (1.16)

Here, p = q + 1, again there being an extra column in X for the 1n vector.Analogous to (1.12), if we let xj be the vector of xij ’s, which is the (j + 1)st column of

X, so thatX = (1n, x1, x2, . . . , xq), (1.17)

we have thaty = β01n + β1x1 + · · · + βqxq + e. (1.18)

Analysis of variance. In analysis of variance, or ANOVA, explanatory variables arecategorical. A one-way ANOVA has one categorical variable, as in the leprosy example(1.7). Suppose in that example, there are two observations for each treatment, so thatn = 6. (The actual experiment had ten observations in each group.) The layout is

Drug A Drug D Controly11, y12 y21, y22 y31, y32

(1.19)

where now the dependent variable is denoted yij, where i indicates the treatment, 1 = DrugA, 2 = Drug D, 3 = Control, and j indicates the individual within the treatment. The linearmodel (1.7) translates to

yij = βj + eij , i = 1, 2, 3, j = 1, 2. (1.20)

To write the model in the matrix form (1.8), we first have to vectorize the yij’s (and eij’s),even though notationally they look like elements of a matrix. Any way you string them out is


fine, as long as you are consistent. We will do it systematically by grouping the observationby treatment, that is,

y =

y11

y12

y21

y22

y31

y32

, and e =

e11

e12

e21

e22

e31

e32

. (1.21)

Writing out the model in matrix form, we have

y11

y12

y21

y22

y31

y32

=

1 0 01 0 00 1 00 1 00 0 10 0 1

β1

β2

β3

+ e. (1.22)

Two-way ANOVA has two categorical explanatory variables. For example, the tablebelow contains the leaf area/dry weight for some citrus trees, categorized by type of citrusfruit and amount of shade:

Orange Grapefruit MandarinSun 112 90 123Half − shade 86 73 89Shade 80 62 81

(1.23)

(From Table 11.2.1 in Statistical Methods by Snedecor and Cochran.) Each variable has3 categories, which means there are 9 categories taking the two variables together. Thedependent variable again has two subscripts, yij , where now the i indicates the row variable(sun/shade) and the j represents the column variable (type of fruit). That is,

Orange Grapefruit MandarinSun y11 y12 y13

Half − shade y21 y22 y23

Shade y31 y32 y33

(1.24)

One linear model for such data is the additive model, in which the mean for yij is the sumof an effect of the ith row variable and an effect for the jth column. That is, suppose α1, α2,and α3 are the effects attached to Sun, Half-shade, and Shade, respectively, and β1, β2, andβ3 are the effects attached to Orange, Grapefruit, and Mandarin, respectively. Then theadditive model is

yij = αi + βj + eij . (1.25)

The idea is that the two variables act separately. E.g., the effect of sun on y is the same foreach fruit. The additive model places a restriction on the means of the cells, that is,

µij ≡ E(Yij) = αi + βj . (1.26)


For example, the following table of µij’s does follow an additive model:


(1.27)

Going from sun to half-shade subtracts 30, no matter which fruit; and going from half-shadeto shade subtracts 10, again no matter which fruit. There are many sets of parameters thatwould fit those means. One is α1 = 0, α2 = −30, α3 = −40, β1 = 110, β2 = 90, β3 = 120. Anexample of a non-additive model has means


(1.28)

There are no sets of parameters that satisfy (1.26) for these µij’s. Note that going from sunto half-shade for Orange subtracts 25, while for Grapefruit it subtracts 30.

To write the additive model (1.25) in matrix form, we have to have 0-1 vectors for therows, and 0-1 vectors for the columns. We will start with writing these vectors in table form.The row vectors:

Sun (α1) Half-shade (α2) Shade (α3)

x1 :

1 1 10 0 00 0 0

x2 :

0 0 01 1 10 0 0

x3 :

0 0 00 0 01 1 1

(1.29)

And the column vectors:

Orange (β1) Grapefruit (β2) Mandarin (β3)

x4 :

1 0 01 0 01 0 0

x5 :

0 1 00 1 00 1 0

x6 :

0 0 10 0 10 0 1

(1.30)

As before, we write the yij’s, eij ’s, and parameters in vectors:

y =

y11

y12

y13

y21

y22

y23

y31

y32

y33

, e =

e11

e12

e13

e21

e22

e23

e31

e32

e33

, and β =

α1

α2

α3

β1

β2

β3

. (1.31)


For the X, we use the vectors in (1.29) and (1.30), making sure that the 0’s and 1’s inthese vectors are correctly lined up with the observations in the y vector. That is, X =(x1, x2, x3, x4, x5, x6), and the model is

y =

y11

y12

y13

y21

y22

y23

y31

y32

y33

= Xβ + e =

Rows︷︸︸︷1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

Columns︷︸︸︷1 0 00 1 00 0 11 0 00 1 00 0 11 0 00 1 00 0 1

α1

α2

α3

β1

β2

β3

+ e, (1.32)

ory = α1x1 + α2x2 + α3x3 + β1x4 + β2x5 + β3x6 + e (1.33)

If the additivity restriction (1.29), that µij = αi + βj , is violated, then the model is saidto have interaction. Specifically, the interaction term for each cell is defined by the difference

γij = µij − αi − βj , (1.34)

and the model with interaction is

yij = αi + βj + γij + eij . (1.35)

To write out the model completely, we need to add the γij ’s to β, and corresponding vectorsto the X:

y =

y11

y12

y13

y21

y22

y23

y31

y32

y33

=

Rows︷︸︸︷1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

Columns︷︸︸︷1 0 00 1 00 0 11 0 00 1 00 0 11 0 00 1 00 0 1

Interactions︷︸︸︷1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 1 0 00 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 1

α1

α2

α3

β1

β2

β3

γ11

γ12

γ13

γ21

γ22

γ23

γ31

γ32

γ33

+ e. (1.36)


We will see later that we do not need that many vectors in X, as there are many redundanciesthe way it is written now.

Analysis of covariance. It may be that the main interest is in comparing the means ofgroups as in analysis of variance, but there are other variables that may be effecting the y.For example, in the study comparing three drugs’ effectiveness in treating leprosy, there werebacterial measurements before and after treatment. The yij’s are the “after” measurement,and one would expect the “before” measurement, in addition to the drugs, to affect the aftermeasurement. Letting zij represent the before measurements, the model modifies (1.20) to

yij = βj + γzij + eij , (1.37)

or in matrix form, modifying (1.22),

y11

y12

y21

y22

y31

y32

=

1 0 0 z11

1 0 0 z12

0 1 0 z21

0 1 0 z22

0 0 1 z31

0 0 1 z32

β1

β2

β3

γ

+ e. (1.38)

1.3 Vector spaces – Definition

The model (1.8), y = Xβ + e, is very general. The X matrix can contain any numbers. Theprevious section gives some ideas of the scope of the model. In this section we look at themodel slightly more abstractly. Letting µ be the vector of means of the elements of y, andX = (x1, x2, . . . , xp), the model states that

µ = β1x1 + β2x2 + · · ·+ βpxp, (1.39)

where the βj ’s can be any real numbers. Now µ is a vector in Rn, and (1.39) shows that µ

is actually in a subset of Rn:

µ ∈ M ≡ {c1x1 + c2x2 + · · ·+ cpxp | c1 ∈ R, . . . , cp ∈ R}. (1.40)

Such a space of linear combinations of a set of vector is called a span.

Definition 1 The span of the set of vectors {x1, . . . , xp} ⊂ Rn is

span{x1, . . . , xp} = {c1x1 + · · ·+ cpxp | ci ∈ R, i = 1, . . . , p}. (1.41)

Because the matrix notation (1.8) is heavily used in this course, we have notation forconnecting the X to the M.

1.3. VECTOR SPACES – DEFINITION 11

Definition 2 For n × p matrix X, C(X) denotes the column space of X, which is thespan of the columns of X. That is, if X = (x1, . . . , xp), then

C(X) = span{x1, . . . , xp}. (1.42)

Spans have nice properties. In fact, a span is a vector space. The formal definition ofa vector space, at least for those that are subsets of R

n, follows. [Vector spaces are moregeneral than those that are subsets of R

n, but since those are the only ones we need, we willstick with this restricted definition.]

Definition 3 A subset M ⊂ Rn is a vector space if

x, y ∈ M =⇒ x + y ∈ M, and (1.43)

c ∈ R, x ∈ M =⇒ cx ∈ M. (1.44)

Thus any linear combination of vectors in M is also in M. Note that Rn is itself a vector

space, as is the set {0n}. [The n × 1 vector of all 0’s is 0n.] Because c in (1.44) can be 0,any subspace must contain 0n. Any line through 0n, or plane through 0n, is a subspace. Itis not hard to show that any span is a vector space. Take M in (1.40). First, if x, y ∈ M,then there are ai’s and bi’s such that

x = a1x1 + a2x2 + · · · + apxp and y = b1x1 + b2x2 + · · ·+ bpxp, (1.45)

so thatx + y = c1x1 + c2x2 + · · · + cpxp, where ci = ai + bi, (1.46)

hence x + y ∈ M. Second, for x ∈ M as in (1.45) and real c,

cx = c1x1 + c2x2 + · · ·+ cpxp, where ci = cai, (1.47)

hence cx ∈ M.Not only is any span a subspace, but any subspace is a span of some vectors. Thus a

linear model (1.8) can equivalently be define as one for which

µ ∈ M (µ = E[Y ]) (1.48)

for some vector space M.Specifying a vector space through span is quite convenient, but not the only convenient

way. Another is to give the form of elements directly. For example, the vector space of allvectors with equal elements can be given in the following two ways:

aa...a

∈ Rn | a ∈ R

= span{1n}. (1.49)


When n = 3, the x/y plane can be represented as

ab0

| a ∈ R, b ∈ R

= span

100

,

010

. (1.50)

A different plane is

aa + b

b

| a ∈ R, b ∈ R

= span

110

,

011

. (1.51)

1.4 Linear independence and bases

Any subspace of Rn can be written as a span of at most n vectors, although not in a unique

way. For example,

span

100

,

010

= span

100

,

010

,

110

= span

100

,

110

= span

200

,

0−70

,

3320

. (1.52)

Note that the space in (1.52) can be a span of two or three vectors, or a span of anynumber more than three as well. It cannot be written as a span of only one vector. Theminimum number of vectors is called the rank of the space, which in this example is 2.Any set of two vectors which does span that space is called a basis. Notice that in thetwo sets of three vectors, there is a redundancy, that is, one of the vectors can be writtenas a linear combination of the other two: (1, 1, 0)′ = (1, 0, 0)′ + (0, 1, 0)′ and (2, 0, 0)′ =(4/(33 ∗ 7))(0,−7, 0)′ + (2/33) ∗ (33, 2, 0)′. Such sets are called linearly dependent.

To formally define basis, we need to first define linear independence.

Definition 4 The vectors x1, . . . , xp in Rn are linear independent if

a1x1 + · · ·+ apxp = 0n =⇒ a1 = · · · = ap = 0. (1.53)

Equivalently, the vectors are linearly independent if no one of them (as long as it is not0n) can be written as a linear combination of the others. That is, they are linear dependentif there is an xi 6= 0n and set of coefficients ai such that

xi = a1x1 + · · ·+ ai−1xi−1 + ai+1xi+1 + . . . + apxp. (1.54)

1.4. LINEAR INDEPENDENCE AND BASES 13

They are not linearly dependent if and only if they are linearly independent. In (1.52),the sets with three vectors are linearly dependent, and those with two vectors are lin-early independent. To see that latter fact for {(1, 0, 0)′, (1, 1, 0)′}, suppose that a1(1, 0, 0)′ +a2(1, 1, 0)′ = (0, 0, 0)′. Then

a1 + a2 = 0 and a2 = 0 =⇒ a1 = a2 = 0, (1.55)

which verifies (1.53). If a set of vectors is linearly dependent, then one can remove one ofthe redundant vectors (1.54), and still have the same span. A basis is a set of vectors thathas the same span but no dependencies.

Definition 5 The set of vectors {z1, . . . , zd} is a basis for the subspace M if the vectorsare linearly independent and M = span{z1, . . . , zd}.

For estimating β, the following lemma is useful.

Lemma 1 If {x1, . . . , xp} is a basis for M, then for x ∈ M, there is a unique set ofcoefficients a1, . . . , ap such that x = a1x1 + · · ·+ apxp.

Although a (nontrivial) subspace has many bases, each basis has the same number ofelements, which is the rank.

Definition 6 The rank of a subspace is the number of vectors in any of its bases.

A couple of useful facts about a vector space M with rank d –

1. Any set of more than d vectors from M is linearly dependent;

2. Any set of d linear independent vectors from M forms a basis of M.

For example, consider the one-way ANOVA model in (1.22). The three vectors in X areclearly linearly independent, hence the space C(X) has rank 3, and those vectors constitutea basis. On the other hand, the columns of X in the two-way additive ANOVA model in(1.32) are not linearly independent: The first three add to 16, as do the last three, hence

x1 + x2 + x3 − x4 − x5 − x6 = 06. (1.56)

Removing any one of the vectors does leave a basis. The model (1.36) has many redundancies.For one thing, n = 9, and there are 15 columns in X. One basis consists of the 9 interactionvectors (i.e., the last 9 vectors). Another consists of the columns of the following matrix,


obtained by dropping a judicious set of vectors from X:

Rows︷︸︸︷1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

Columns︷︸︸︷1 00 10 01 00 10 01 00 10 0

Interactions︷︸︸︷1 0 0 00 1 0 00 0 0 00 0 1 00 0 0 10 0 0 00 0 0 00 0 0 00 0 0 0

(1.57)

Not just any 9 vectors of X will be a basis, though. For example, the first 9 are not linearlyindependent, as in (1.56).

1.5 Summary

This chapter introduced linear models, with some examples, and showed that any linearmodel can be expressed in a number of equivalent ways:

1. Each yi is written as a linear combinations of xi’s, plus residual (e.g., equation 1.13);

2. The vector y is be written as a linear combination of xi vectors, plus vector of residuals(e.g., equation 1.18);

3. The vector y is written as Xβ, plus vector of residuals (as in equation 1.8);

4. The mean vector is restricted to a vector space, as in (1.48).

Each representation is useful in different situations, and it is important to be able to go fromone to the others.

In the next chapter, we consider estimation of the mean E(Y ) and the parameters β.What can be estimated, and how, depends on the vectors in X, and whether they form abasis.

Chapter 2

Estimation and Projections

This chapter considers estimating E[Y ] and β. There are many estimation techniques, andwhich are best depends on the distribution assumptions made on the residuals. We start withminimal assumption on the residuals, and look at unbiased estimation using least squares.

The basic model isy = Xβ + e, where E[e] = 0n. (2.1)

The expected value of a vector is just the vector of expected values, so that

E[e] = E

e1

e2...en

=

E[e1]E[e2]

...E[en]

. (2.2)

Then, as in (1.39), with µ ≡ E[Y ] = Xβ + E[e], so that

µ = Xβ. (2.3)

2.1 Projections and estimation of the mean

We first divorce ourselves from β, and just estimate µ ∈ M (2.3) for a vector space M. Theidea in estimating µ is to pick an estimate µ ∈ M that is “close” to the observed y. Theleast squares principle is to find the estimate that has the smallest sum of squares from y.That is, µ is the vector in M such that

n∑

i=1

(yi − µi)2 ≤

n∑

i=1

(yi − ai)2 for any a ∈ M. (2.4)

The length of a vector x ∈ Rn is

√∑ni=1 x2

i , which is denoted by the norm ‖x‖, so that

‖x‖2 =n∑

i=1

x2i = x′x. (2.5)

15

16 CHAPTER 2. ESTIMATION AND PROJECTIONS

Thus we can define the least squares estimate of µ ∈ M to be the µ ∈ M such that

‖y − µ‖2 ≤ ‖y − a‖2 for any a ∈ M. (2.6)

Suppose n = 2, and the space M = span{x} for some nonzero vector x. Then M is aline through the origin. Take a point y somewhere in the space, and try to find the pointon the line closest to y. Draw a line segment from y to that point, and you will notice thatthat segment is perpendicular, or orthogonal, to the line M. This idea can be generalizedto any vector space. First, we define orthogonality.

Definition 7 For vectors and vector spaces in Rn:

1. Two vectors a and b are orthogonal, written a ⊥ b, if a′b = 0.

2. A vector a is orthogonal to the vector space M, written a ⊥ M, if a is orthogonal toevery x ∈ M.

3. Two vector spaces M and N are orthogonal, written M ⊥ N , if

a ∈ M, b ∈ N =⇒ a ⊥ b. (2.7)

4. The orthogonal complement of a vector space M, written M⊥, is the set of allvectors in R

n orthogonal to M, that is,

M⊥ = {a ∈ Rn | a ⊥ M}. (2.8)

Projecting a point on a vector space means dropping a perpendicular from the point tothe space. The resulting vector in the space is the projection.

Definition 8 For vector y and vector space M in Rn, the projection of y onto M is the

point y ∈ M such thaty − y ⊥ M. (2.9)

It turns out that the projection solves the least squares problem.

Proposition 1 For y and M in Rn, the unique least squares estimate µ in (2.6) is y, the

projection of y onto M.

Proof. Let a be any element of M. Then

‖y − a‖2 = ‖(y − y) + (y − a)‖2

= ‖y − y‖2 + ‖y − a‖2 + 2(y − y)′(y − a). (2.10)

Now y ∈ M (because that’s where projections live) and a ∈ M by assumption, so y−a ∈ M(Why?). But y − y ⊥ M, because y is the projection of y, so in particular y − y ⊥ a, hencethe last term in (2.10) is 0. Thus if a 6= y,

‖y − a‖2 > ‖y − y‖2, (2.11)

which means that µ = y, and it is unique. 2

2.1. PROJECTIONS AND ESTIMATION OF THE MEAN 17

2.1.1 Some simple examples

Suppose M = span{1n} = {(a, a, . . . , a)′ | a ∈ R}. Then the projection of y onto M is avector (b, b, . . . , b)′ (∈ M) such that (y − (b, b, . . . , b)′) ⊥ M, i.e.,

y −

bb...b

′

aa...a

= 0 for all a ∈ R. (2.12)

Now (2.12) means that

n∑

i=1

(yi − b)a = a(n∑

i=1

yi − nb) = 0 for all a ∈ R. (2.13)

The only way that last equality can hold for all a is ifn∑

i=1

yi − nb = 0, (2.14)

or

b =

∑ni=1 yi

n= y. (2.15)

Thus the projection is

y =

yy...y

. (2.16)

Extend this example to M = span{x} for any fixed nonzero vector x ∈ Rn. Because it

is an element of M, y = cx for some c, and for y − y to be orthogonal to M, it must beorthogonal to x, that is,

x′(y − cx) = 0. (2.17)

Solve for c:

x′y = cx′x =⇒ c =x′y

x′x, (2.18)

so that

y =x′y

x′xx. (2.19)

Next, consider the one-way ANOVA model (1.22),

y11

y12

y21

y22

y31

y32

=

1 0 01 0 00 1 00 1 00 0 10 0 1

β1

β2

β3

+ e, (2.20)


so that

M =

110000

,

001100

,

000011

. (2.21)

Now y = (a, a, b, b, c, c)′ for some a, b, c. For y − y to be orthogonal to M, it is enoughthat it be orthogonal to the spanning vectors of M.

Proposition 2 If M = span{x1, x2, . . . , xp}, then a ⊥ M if and only if a ⊥ xi for i =1, . . . , p.

Proof. If a ⊥ M, then a ⊥ xi for each i because x is orthogonal to all vectors in M.So suppose a ⊥ xi for i = 1, . . . , p, and take any x ∈ M. By definition of span, x =c1x1 + · · · + cpxp, so that

x′a = (c1x1 + · · ·+ cpxp)′a = c1x

′1a + · · · + cpx

′pa = 0, (2.22)

because each x′ia = 0. 2

Writing down the equations resulting from (y − (a, a, b, b, c, c)′)′x for x being each of thespanning vectors in (2.21) yields

y11 − a + y12 − a = 0

y21 − b + y22 − b = 0

y31 − c + y32 − c = 0. (2.23)

It is easy to solve for a, b, c:

a =y11 + y12

2≡ y1· ; b =

y21 + y22

2≡ y2· ; c =

y31 + y32

2≡ y3·. (2.24)

These equations introduce the “dot” notation: When a variable has multiple subscripts,then replacing the subscript with a “·”, and placing a bar over the variable, denotes theaverage of the variable over that subscript.

2.1.2 The projection matrix

Rather than figuring out the projection for every y, one can find a matrix M that gives theprojection.

Definition 9 For vector space M, the matrix M such that y = My for any y ∈ Rn is called

the projection matrix.

2.1. PROJECTIONS AND ESTIMATION OF THE MEAN 19

The definition presumes that such an M exists and is unique for any M, which is true.In Proposition 4, we will construct the matrix. We will first reprise the examples in Section2.1.1, exhibiting the projection matrix.

For M = span{1n}, the projection (2.16) replaces the elements of y with their average,so the projection matrix must satisfy

My =

yy...y

. (2.25)

The only matrix that will accomplish that feat is

M =

1n

1n

· · · 1n

1n

1n

· · · 1n

......

. . ....

1n

1n

· · · 1n

=1

n1n1′n. (2.26)

Notice that 1n1′n is the n × n matrix of all 1’s.Next, consider M = span{x} (x 6= 0n). We have from (2.19), with a little rewriting,

y =xx′

x′xy . (2.27)

There is the matrix, i.e.,

M =xx′

x′x. (2.28)

Note that if x = 1n, then (2.28) is the same as (2.26), because 1′n1n = n.For the M in (2.21), we have that the projection (2.24) replaces each element of y with

the average of its group. In this case, each group has just 2 elements, so that

M =

12

12

0 0 0 012

12

0 0 0 00 0 1

212

0 00 0 1

212

0 00 0 0 0 1

212

0 0 0 0 12

12

. (2.29)

Next are some important properties of projection matrices. We need the n × n identitymatrix, which is denoted In, and its columns inj:

In =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

= (in1, in2, . . . , inn). (2.30)


Proposition 3 The projection matrix M for vector space M has the following properties:

1. It exists;

2. It is unique;

3. It is symmetric, M = M′;

4. It is idempotent, MM = M;

5. For x ∈ M, Mx = x;

6. For x ⊥ M, Mx = 0n;

7. In −M is the projection matrix for M⊥.

Proof. Let inj be the unique projection of inj onto M, j = 1, . . . , n, and

In = (in1, in2, . . . , inn). (2.31)

This In is a projection matrix: Take y ∈ Rn and let y = Iny. By definition of projection

(Def. 8), each inj ∈ M, hence

y = Iny = y1in1 + y2in2 + · · · + yninn ∈ M. (2.32)

Also, each inj − inj is orthogonal to M, hence

y − y = (In − In)y = y1(in1 − in1) + y2(in2 − in2) + · · · + yn(inn − inn) ⊥ M. (2.33)

Thus y is the projection of y onto M, hence Inis a projection matrix, proving item 1.

For item 2, suppose M is a projection matrix. Then Minj = inj for each j, that is,

MIn = In, or M = In. Thus the projection is unique.For item 3, we know that the columns of In are orthogonal to the columns of In − In,

hence(In − In)

′In = 0, which =⇒ In = I′nIn. (2.34)

The matrix I′nIn is symmetric, so In, i.e., M is symmetric.Skipping to item 5, if x ∈ M, then clearly the closest point to x in M is x itself, that

is, x = x, hence Mx = x. Then item 4 follows, because for any x, Mx = x ∈ M, henceMMx = x, and the uniqueness of the projection matrix shows that MM = M.

Item 6 is easy: If x ⊥ M, then x − 0n ⊥ M, and 0n ∈ M, hence x = 0n. Item 7follows because if x is the projection of x onto M, x − x is the projection onto M⊥ (sincex − x ∈ M⊥ and x − (x − x) ∈ (M⊥)⊥ = M). 2

Those are all useful properties, but we still would like to know how to construct M. Thenext proposition gives an linear equation to solve to obtain the projection matrix. It is ageneralization of the procedure we used in these examples.

2.2. ESTIMATING COEFFICIENTS 21

Proposition 4 Suppose M = C(X), where X = (x1, . . . , xp). If {x1, . . . , xp} is a basis forM, then

M = X(X′X)−1X′. (2.35)

The proposition uses that X′X is invertible if its columns are linearly independent. Wewill show that later. We do note that even if the columns are not linearly independent,(X′X)−1 can be replaced by any generalized inverse, which we will mention later as well.

Proof of proposition. For any given x, let x be its projection onto M, so that x = Xb forsome vector b. Because x − x ⊥ M, x − x ⊥ xj for each j, so that X′(x − x) = 0p, hence

X′(x −Xb) = 0p, which =⇒ X′x = X′Xb =⇒ b = (X′X)−1X′x, (2.36)

andx = X(X′X)−1X′x. (2.37)

Thus (2.35) holds. 2

Compare (2.28) for p = 1 to (2.35). Also note that it is easy to see that this M issymmetric and idempotent. Even though the basis is not unique for a given vector space,the projection matrix is unique, hence any basis will yield the same X(X′X)−1X′.

It is interesting that any symmetric idempotent matrix M is a projection matrix for somevector space, that vector space being

M = {Mx | x ∈ Rn} (2.38)

2.2 Estimating coefficients

2.2.1 Coefficients

Return to the linear model (2.1),

y = Xβ + e, where E[e] = 0n. (2.39)

In this section we consider estimating β, or linear functions of β. In some sense, this task isless fundamental than estimating µ = E[Y ], since the meaning of any βj depends not onlyon its corresponding column in X, but also what other columns happen to be in X. Forexample, consider these five equivalent models for µ in the one-way ANOVA (1.22):

µ = X1β1=

1 0 01 0 00 1 00 1 00 0 10 0 1

µ1

µ2

µ3


= X2β2=

1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

µα1

α2

α3

= X3β3=

1 1 01 1 01 0 11 0 11 0 01 0 0

µα1

α2

= X4β4=

1 1 01 1 01 0 11 0 11 −1 −11 −1 −1

µα1

α2

= X5β5=

1 1 11 1 11 1 −11 1 −11 −2 01 −2 0

µγ1

γ2

(2.40)

The µ1, µ2, µ3 in β1

are the means of the three groups, i.e., µj = E[Yij]. Comparing to β2,

we see that µj = µ+αj , but we still do not have a good interpretation for µ and the α’s. Forexample, if µ = 0, then αj = µj, the mean of the jth group. But if µ = µ ≡ (µ1 +µ2 +µ3)/3,the overall average, then αj = µj − µ, the “effect” of group j. Thus the presence of the 16

vector in X2 changes the meaning of the coefficients of the other vectors.

Now in β2, one may make the restriction that α3 = 0. Then one has the third model,

and µ = µ3, α1 = µ1 − µ3, α2 = µ2 − µ3. Alternatively, a common restriction is thatα1 + α2 + α3 = 0, so that the second formulation becomes the fourth:

µ = X2β2=

1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

µα1

α2

−α1 − α2

=

1 1 01 1 01 0 11 0 11 −1 −11 −1 −1

µα1

α2

. (2.41)

Now µ = µ and αj = µj − µ, the effect.


The final expression has

µ1 = µ + γ1 + γ2, µ2 = µ + γ1 − γ2, µ3 = µ − 2γ1, (2.42)

from which can be derived

µ = µ, γ1 =1

3(1

2(µ1 + µ2) − µ3), γ2 =

1

2(µ1 − µ2). (2.43)

Then, e.g., for the leprosy example (1.19), µ is the overall bacterial level, γ1 contrasts theaverage of the two drugs and the placebo, and γ3 contrasts the two drugs.

2.2.2 Least squares estimation of the coefficients

We know that the least squares estimate of the mean µ = E[Y ] is µ = y, the projection of yonto M = C(X) in (2.39). It exists and is unique, because the projection is. A least squaresestimate of β is one that yields the projection.

Definition 10 In the model y = Xβ + e, a least squares estimate β of β is any vector forwhich

y = Xβ, (2.44)

where y is the projection of y into C(X).

A least squares estimate of a linear combination λ′β, where λ ∈ Rp, is λ′β for any least

squares estimate β of β.

A least squares estimate of β always exists, but it may not be unique. The condition foruniqueness is direct.

Proposition 5 The least squares estimate of β is unique if and only if the column of X arelinearly independent.

The proposition follows from Lemma 1, because if the columns of X are linearly inde-pendent, they form a basis for C(X), hence there is a unique set of βj’s that will solve

y = β1x1 + · · · + βpxp. (2.45)

And if the columns are not linearly independent, there are many sets of coefficients that willyield the y.

If columns of X are linearly independent, then X′X is invertible. In that case, as in theproof of Proposition 4, equation (2.36), we have that

y = Xb for b = (X′X)−1X′y, (2.46)

which means that the unique least squares estimate of β is

β = (X′X)−1X′y. (2.47)


For an example, go back to the one-way ANOVA model (2.41). We know from (2.24) that

y =

y1·y1·y2·y2·y3·y3·

. (2.48)

Depending on the X matrix used, there may or may not be a unique estimate for the β. Forexample, for X1β1

, the unique estimates of the group means are the sample group means:µi = yi·. The third, fourth, and fifth formulations also have unique estimates. E.g., for thefifth, as in (2.43),

µ = y··, γ1 =1

3(1

2(y1· + y2·) − y3·), γ2 =

1

2(y1· − y2·). (2.49)

Consider the second expression,

y1·y1·y2·y2·y3·y3·

= X2β2=

1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

µα1

α2

α3

. (2.50)

The columns of X2 are not linearly independent, so there are many possible estimates forthe parameters, e.g.,

µα1

α2

α3

:

0y1·y2·y3·

,

y··y1· − y··y2· − y··y3· − y··

,

y3·y1· − y3·y2· − y3·

0

, and

83.72y1· − 83.72y2· − 83.72y3· − 83.72

. (2.51)

It is interesting that even if β does not have a unique estimate, some linear combinationsof it do. In the example above, (2.51), µ and α1 do not have unique estimates, but µ+ α1 =(1, 1, 0, 0)β does, i.e., y1·. That is, all least squares estimates of β have the same (1, 1, 0, 0)β.Any contrast among the αi’s also has a unique estimate, e.g., α1 − α2 has unique estimatey1· − y2·, and α1 + α2 − 2α3 has unique estimate y1· + y2· − 2y3·.

This discussion leads to the notion of estimability, which basically means that there isa unique least squares estimate. Here is the formal definition.

Definition 11 A linear combination λ′β is estimable if it has an unbiased linear estimator,that is, there exists an n × 1 vector a such that

E[a′Y ] = λ′β for all β ∈ Rp. (2.52)


With the above example, µ + α1 has λ = (1, 1, 0, 0)′, and taking a = (1/2, 1/2, 0, 0, 0, 0)′,so that a′y = y1·, we have that

E[a′Y ] =1

2(E[Y11] + E[Y12]) =

1

2(µ + α1 + µ + α1) = µ + α1 (2.53)

no matter what µ and the αi’s are. That is not the unique estimator. Note that Y11 aloneworks, also.

On the other hand, consider µ = λ′β with λ = (1, 0, 0, 0)′. Can we find a so that

E[a′Y ] = E[(a11, a12, a21, a22, a31, a32)Y ]

= (a11 + a12)(µ + α1) + (a21 + a22)(µ + α2) + (a31 + a32)(µ + α3)

≡ µ? (2.54)

For that to occur, we need

a11 + a12 + a21 + a22 + a31 + a32 = 1, a11 + a12 = 0, a21 + a22 = 0, a31 + a32 = 0. (2.55)

Those equations are impossible to solve, since the last three imply that the sum of all aij’sis 0, not 1. Thus µ is not estimable.

The next proposition systematizes how to check for estimability.

Proposition 6 In the model y = Xβ + e, with E[e] = 0n, λ′β is estimable if and only ifthere exists an n × 1 vector a such that

a′X = λ′. (2.56)

Proof. By Definition 11, since E[Y ] = Xβ in (2.52), λ′β is estimable if and only if thereexists a such that a′Xβ = λ′β for all β ∈ R

p. But that equality is equivalent to a′X = λ′. 2

Note that the condition (2.56) means that λ is a linear combination of the rows of X,i.e., λ ∈ C(X′). (Or we could introduce the notation R(X) to denote the span of the rows.)

To see how that works in the example (2.50), look at

C(X′) = span

1100

,

1010

,

1001

. (2.57)

(The rows of X have some duplicate vectors.) Then λ + α1 has λ = (1, 1, 0, 0)′, which isclearly in C(X′), since it is one of the basis vectors. (Similarly for µ + α2 and µ + α3.)The contrast α1 − 2α2 + α3 has λ = (0, 1,−2, 1). That vector is also in C(X′), sinceλ′ = (1, 1, 0, 0)′ − 2(1, 0, 1, 0)′ + (1, 0, 0, 1)′. On the other hand, consider estimating µ, which


has λ = (1, 0, 0, 0)′. To see that that vector is not in C(X′), we have to show that there areno a, b, c so that

1000

= a

1100

+ b

1010

+ c

1001

=

a + b + cabc

. (2.58)

Clearly equation (2.58) requires that a = b = c = 0, so a + b + c cannot equal 1. Thus µ isnot estimable. Also, the individual αi’s are not estimable.

If λ′β is estimable, then there is a unique least squares estimate of it.

Proposition 7 If λ′β is estimable, the unique least squares estimate is λ′β for any least

squares estimate β of β.

Of course, if there is a unique least squares estimate β of β, then the unique least square

estimate of any λ′β is λ′β.

Proof of proposition. If λ′β is estimable, then by Proposition 6 there exists an a such that

a′X = λ′. If β1

and β2

are least squares estimates of β, then by Definition 10, µ = Xβ1

=

Xβ2, hence a′Xβ

1= a′Xβ

2, which implies that λ′β

1= λ′β

2. Thus the least squares estimate

of λ′β is unique. 2

Later, we show that the least squares estimators are optimal under certain conditions.

If one can find an unbiased estimate of λ′β, then the least squares estimate uses the samelinear combination, but of the y.

Proposition 8 If a′y is an unbiased estimator of λ′β, then a′y is the least squares estimateof λ′β.

Proof. As in (2.56), a′X = λ′. By (2.44), any least squares estimate β of β satisfies y = Xβ.

Thus a′y = a′Xβ = λ′β, which is the least squares estimate of λ′β. 2

For example, consider the one-way ANOVA model (2.41). The µ + α1 is estimable, and,e.g., y11 is an unbiased estimate, which has a = (1, 0, 0, 0, 0, 0)′. So by the proposition, a′yis the least squares estimate. From (2.48), we know that yij = yi·, hence a′y = y1· is theleast squares estimate of µ + α1. Note that there are other a’s, e.g., y12 is also an unbiasedestimate, and has a∗ = (0, 1, 0, 0, 0, 0)′. Although this a∗ is different from a, a∗′y = y1·, too.That has to be true, since by Proposition 7 the least squares estimate is unique, but we alsosee that starting with any unbiased a′y, we can find the least squares estimate by replacingy with y.


2.2.3 Example: Leprosy

Below are data on leprosy patients (from Snedecor and Cochran, Statistical Methods). Therewere 30 patients, randomly allocated to three groups of 10. The first group received drug A,the second drug D, and the third group received a placebo. Each person had the bacterialcount taken before and after receiving the treatment.

Drug A Drug D Placebo

Before After Before After Before After

11 6 6 0 16 13

8 0 6 2 13 10

5 2 7 3 11 18

14 8 8 1 9 5

19 11 18 18 21 23

6 4 8 4 16 12

10 13 19 14 12 5

6 1 8 9 12 16

11 8 5 1 7 1

3 0 15 9 12 20

First, consider the one-way ANOVA, with the “after” measurements as the y’s, ignoringthe “before” measurements. The model is

y11

y12...

y1,10

−y21

y22...

y2,10

−y31

y32...

y3,10

=

1 1 0 01 1 0 0...

......

...1 1 0 0− − − −1 0 1 01 0 1 0...

......

...1 0 1 0− − − −1 0 0 11 0 0 1...

......

...1 0 0 1

µα1

α2

α3

+ e. (2.59)

The sample means are y1· = 5.3, y2· = 6.1, y3· = 12.3. Suppose we are interested in thetwo contrasts α1 −α2, comparing the two drugs, and (α1 + α2)−α3, comparing the placeboto the average of the two drugs. The least squares estimates are found by taking the samecontrasts of the sample means:

α1 − α2 = 5.3 − 6.1 = −0.8, (α1 + α2) − α3 = (5.3 + 6.1)/2 − 12.3 = −6.6. (2.60)


There doesn’t appear to be much difference between the two drugs, but their average seemsbetter than the placebo. Is it significantly different? That question will be addressed in thenext chapter. We need standard errors for the estimates.

What about the “before” measurements? The sample means of the before measurementsare 9.3, 10, and 12.9, respectively. Thus, by chance, the placebo group happened to get peoplewho were slightly worse off already, so it might be important to make some adjustments.A simple one would be to take the y’s as after − before, which is very reasonable in thiscase. Instead, we will look at analysis of covariance model (1.38), which adds in the beforemeasurements (the covariates) as zij ’s:

y11

y12...

y1,10

−y21

y22...

y2,10

−y31

y32...

y3,10

=

1 1 0 0 z11

1 1 0 0 z12...

......

......

1 1 0 0 z1,10

− − − − −1 0 1 0 z21

1 0 1 0 z22...

......

......

1 0 1 0 z2,10

− − − − −1 0 0 1 z31

1 0 0 1 z32...

......

......

1 0 0 1 z3,10

µα1

α2

α3

γ

+ e =

1 1 0 0 111 1 0 0 8...

......

......

1 1 0 0 3− − − − −1 0 1 0 61 0 1 0 6...

......

......

1 0 1 0 15− − − − −1 0 0 1 161 0 0 1 13...

......

......

1 0 0 1 12

µα1

α2

α3

γ

+ e.

(2.61)We want to estimate the same contrasts. Start with α1 − α2. Is it estimable? We need

to show that there is a a such that a′X = (0, 1,−1, 0, 0). We can do it with just three rowsof X, the first two and the eleventh. That is, we want to find a, b, c so that

a(1, 1, 0, 0, 11) + b(1, 1, 0, 0, 8) + c(1, 0, 1, 0, 6) = (0, 1,−1, 0, 0), (2.62)

or

a + b + c = 0

a + b = 1

c = −1

11a + 8b + 6c = 0 (2.63)

The second and third equations imply the first, and the third is that c = −1, hence usingb = 1 − a, the fourth equation yields a = −2/3, hence b = 5/3. Thus

−2

3y11 +

5

3y12 − y21 is an unbiased estimate of α1 − α2. (2.64)


The least squares estimate replaces the yij’s with their hats, the yij ’s. We know inprinciple the projection y from Question 4 of HW #2 (where now we have 10 instead of 2observations in each group). That is, the projection vector has elements of the form

y1j = a + d z1j , y2j = b + d z2j , y3j = c + d z3j . (2.65)

The constants are

a = y1· − d z1·, b = y2· − d z2·, c = y3· − d z3·, (2.66)

henceyij = yi· + d (zij − zi·). (2.67)

The d is

d =

∑3i=1

∑10j=1(yij − yi·)zij∑3

i=1

∑10j=1(zij − zi·)zij

. (2.68)

Plugging in the data, we obtain that d = 585.4/593 = 0.987.Back to estimating α1 −α2, substitute the yij’s of (2.67) for the yij’s in (2.64) to get the

least squares estimate

−2

3y11 +

5

3y12 − y21 = −2

3(y1· + d (z11 − z1·)) +

5

3(y1· + d (z12 − z1·))

−(y2· + d (z21 − z2·))

= (y1· − d z1·) − (y2· − d z2·) + d(−2

3z11 +

5

3z12 − z21)

= (y1· − d z1·) − (y2· − d z2·)

= 5.3 − 0.987(9.3) − (6.1 − 0.987(10))

= −0.109. (2.69)

The third line comes from the second line since −(2/3)z11 + (5/3)z12 − z21 = −(2/3)11 +(5/3)8−6 = 0. Notice that the least squares estimate is the same as that without covariates,but using adjusted (yi· − d zi·)’s instead of plain yi·’s.

The unadjusted estimate (not using the covariate) was −0.8, so the adjusted estimate iseven smaller.

A similar procedure will show that the least squares estimate of (α1 + α2)/2 − α3 is

(y1· − d z1·) + (y2· − d z2·)

2− (y3· − d z3·) = −3.392. (2.70)

This value is somewhat less (in absolute value) than the unadjusted estimate −4.60.


Chapter 3

Variances and Covariances

The previous chapter focussed on finding unbiased estimators for µ, β, and λ′β. In thissection we tackle standard errors and variances, and in particular look for estimators withsmall variance. The variance of a random variable Z is V ar(Z) = E[(Z − µZ)2], whereµZ = E[Z]. [Note: These are true if the expectations exist.] With two variables, Y1 and Y2,there is the covariance:

Cov(Y1, Y2) = E[(Y1 − µ1)(Y2 − µ2)], where µ1 = E(Y1) and µ2 = E(Y2). (3.1)

The covariance of a variable with itself is the variance.

3.1 Covariance matrices for affine transformations

Defining the mean of a vector or of a matrix is straightforward: it is just the vector or matrixof means. That is, as in (2.2), for vector Y = (Y1, . . . , Yn)′,

E(Y ) =

E(Y1)E(Y2)

...E(Yn)

, (3.2)

and for an n × p matrix W,

E[W] = E

W11 W12 · · · W1p

W21 W22 · · · W2p...

.... . .

...Wn1 Wn2 · · · Wnp

=

E(W11) E(W12) · · · E(W1p)E(W21) E(W22) · · · E(W2p)

......

. . ....

E(Wn1) E(Wn2) · · · E(Wnp)

. (3.3)

Turning to variances, a n × 1 vector Y = (Y1, . . . , Yn)′ has n variances (the V ar(Yi)’s),

and several covariances Cov(Yi, Yj). These are all conveniently arranged in the covariance

31

32 CHAPTER 3. VARIANCES AND COVARIANCES

matrix, defined for vector Y to be the n×n matrix Cov(Y ) whose ij th element is Cov(Yi, Yj).It is often denoted Σ:

Cov(Y ) = Σ =

V ar(Y1) Cov(Y1, Y2) · · · Cov(Y1, Yn)Cov(Y2, Y1) V ar(Y2) · · · Cov(Y2, Yn)

......

. . ....

Cov(Yn, Y1) Cov(Yn, Y2) · · · V ar(Yn)

. (3.4)

The diagonals are the variances, and the matrix is symmetric because Cov(Yi, Yj) = Cov(Yj, Yi).Analogous to the definition of variance, an equivalent definition of covariance is

Cov(Y ) = E[(Y − µ)(Y − µ)′

], where µ = E(Y ). (3.5)

The variances and covariances of linear combinations are often needed in this course, e.g.,the least squares estimates are linear combinations of the yi’s. Fortunately, the means and(co)variances of linear combinations are easy to obtain from those of the originals. With justone variable Z, we know that for any a and b,

E[a + b Z] = a + b E[Z] and V ar[a + b Z] = b2 V ar[Z]. (3.6)

Note that the constant a does not effect the variation.Turn to an n × 1 vector, Z, and consider the affine transformation

W = a + BZ (3.7)

for some m× 1 vector a and m× n matrix B, so that W is m× 1. (A linear transformationwould be BZ. The word “affine” pops up because of the additional constant vector a.)Expected values are linear, so

E[W ] = E[a + B Z] = a + B E[Z]. (3.8)

For the covariance, start by noting that

W − E[W ] = (a + B Z) − (a + B E[Z]) = B(Z − E[Z]), (3.9)

so that

Cov[W ] = Cov[a + BZ] = E[(W − E[W ])(W − E[W ])′]

= E[B(Z − E[Z])(Z − E[Z ])′B′]

= B E[(Z − E[Z])(Z − E[Z])′] B′

= B Cov[Z] B′. (3.10)

Compare these formulas to the univariate ones, (3.6).

3.2. COVARIANCES OF ESTIMATES 33

3.2 Covariances of estimates

Whether estimating µ, β, or λ′β, it is important to also estimate the corresponding standard

errors. (The standard error of an estimator θ is its standard deviation, se(θ) =√

V ar[θ].)So far we have assumed that E[e] = 0n, but nothing about the variances or covariancesof the residuals. For most (if not all) of the course we will assume that the residuals areindependent and have equal variances. Later we will also add the assumption of normality.As with all assumptions, one should perform diagnostic checks.

If two variables are independent, then their covariance is 0. Thus under the assumptionthat e1, . . . , en are independent, and have the same variance σ2

e = V ar[ei], the linear modelbecomes

y = Xβ + e, where E[e] = 0n, Cov[e] = σ2eIn. (3.11)

With this assumption, it is easy to write down the desired covariances. First, for theleast squares estimate µ of µ, which is the projection y = My, where M is the projectionmatrix for C(X),

Cov[µ] = Cov[My]

= Cov[Me] because Xβ is a constant

= M Cov[e] M′ by (3.10)

= Mσ2eInM

′ by (3.11)

= σ2eM because MM′ = M (3.12)

Now if λ′β is estimable, then, as in (2.52), there exists an a such that E[a′Y ] = λ′β. Thevariance of this estimator is then

V ar[a′Y ] = a Cov(Y ) a′ = a σ2eIn a′ = σ2

e‖a‖2. (3.13)

This a need not be unique, but there is only one a that yields the least squares estimate.Below we find the variance of that estimator in general, but first do it in the case that X′Xis invertible. In that case, we know that from (2.47), the least squares estimate of β is

β = (X′X)−1X′y, (3.14)

henceCov(β) = (X′X)−1X′[σ2

eIn]X(X′X)−1 = σ2e(X

′X)−1. (3.15)

ThenV ar(λ′β) = σ2

eλ′(X′X)−1λ. (3.16)

Consider the one-way ANOVA model again, with three groups and 2 observations pergroup: yij = µ + αi + eij , i = 1, 2, 3, j = 1, 2. Let λ′β = (α1 + α2)/2 − α3, so thatλ′ = (0, 1/2, 1/2,−1). This linear combination is estimable, e.g., (y11 + y21)/2 − y31 is anunbiased estimate, which has a′ = (1/2, 0, 1/2, 0,−1, 0). Thus

V ar[(Y11+Y21)/2−Y31] = σ2e‖a‖2 = σ2

e((1/2)2+02+(1/2)2+02+(−1)2+02) =3

2σ2

e . (3.17)


This estimate is not the least squares estimate, but we know from Proposition 8 that if a′yis unbiased, then a′y is the least squares estimate. The projection is yij = yi·, so the leastsquares estimate is (y1· + y2·)/2 − y3·, and

V ar[(y1·+y2·)/2−y3·] =1

4V ar[y1·]+

1

4V ar[y2·]+V ar[y3·] = (

1

4+

1

4+1)

1

2σ2

e =3

4σ2

e , (3.18)

because V ar[yi·] = σ2e/2. Note that this variance is half that of the other estimator (3.17).

The a vector for the least squares estimate can be seen to be a∗ = (1/4, 1/4, 1/4, 1/4,−1/2,−1/2)′,which indeed has ‖a∗‖2 = 4(1/4)2 + 2(1/2)2 = 3/4. Now compare the two a’s:

a =

12

012

0−1

0

, a∗ =

14141414

−12

−12

. (3.19)

What relationship is there? Notice that a∗ is in C(X), but a is not. In fact, a∗ is theprojection of a onto C(X). That is not just a coincidence, as shown in the next result.

Proposition 9 If λ′β is estimable, and a′y is an unbiased estimate, then the least squaresestimate is a′y, where a is the projection of a onto C(X).

Proof. By Proposition 8, a′y is the least squares estimate, but y = My, where M is theprojection matrix for C(X), hence

a′y = a′My = (Ma)′y = a′y. 2 (3.20)

Because for an estimable function, the least squares estimate is unique, a must be unique,that is, any a that yields an unbiased estimate has the same projection. Also, we noted thatthe least squares estimate had a lower variance than the alternative in the above example.This feature is also general.

Definition 12 A linear estimator of a parameter θ is the best linear unbiased estimator,or BLUE, if it unbiased, and any other unbiased linear estimator has a variance at least aslarge.

Theorem 1 Gauss-Markov. In the model (3.11) with σ2e > 0, if λ′β is estimable, then

least squares estimate is the unique BLUE.

Proof. For any a ∈ Rn,

‖a‖2 = ‖a + (a − a)‖2 = ‖a‖2 + ‖a − a‖2, (3.21)

3.3. ESTIMATING THE VARIANCE 35

because a ⊥ (a − a), hence

‖a‖2 > ‖a‖2 unless a = a. (3.22)

Thus if a′y is an unbiased estimator of λ′β that is not the least squares estimate, then a 6= a,a′y is the least squares estimate by Proposition 9, and

V ar(a′y) = σ2e‖a‖2 > σ2

e‖a‖2 = V ar(a′y). (3.23)

That is, any unbiased linear estimate has larger variance than the least squares estimate. 2

Thus we have established that the least squares estimate is best in terms of variance.The next section develops estimates of the variance.

3.3 Estimating the variance

In the model (3.11), σ2e is typically a parameter to be estimated. Because the residuals have

mean zero, E(ei) = 0, we have that σ2e = V ar(ei) = E(e2

i ), so that

E

[∑ni=1 e2

i

n

]= σ2

e . (3.24)

Unfortunately, we do not observe the actual ei’s, because e = y−Xβ, and β is not observed.Thus we have to estimate e, which we can do by plugging in the (or a) least squares estimateof β:

e = y −Xβ = y − y = (In −M)y, (3.25)

where M is the projection matrix for C(X). Note that e is the projection of y onto C(X)⊥.See Proposition 3. Because E[y] = E[y) = Xβ, E[e] = 0n, hence as for the ei’s, V ar[ei] =E[e2

i ], but unlike the ei’s, the ei’s do not typically have variance σ2e . Rather, the variance of

ei is σ2e times the ith diagonal of (In −M). Then

E[n∑

i=1

e2i ] =

n∑

i=1

V ar[ei] = σ2e × [sum of diagonals of (In −M)] = σ2

e trace(In − M). (3.26)

Thus an unbiased estimator of σ2e is

σ2e =

∑ni=1 e2

i

trace(In −M)=

‖e‖2

trace(In − M). (3.27)

It is easy enough to calculate the trace of a matrix, but the trace of a projection matrixis actually the rank of the corresponding vector space.

Proposition 10 The rank of a vector space M is trace(M), where M is the projectionmatrix for M.


Proof. Suppose rank(M) = p, and let {x1, . . . , xp} be a basis for M. Then with X =(x1, . . . , xp) (so that M = C(X), the projection matrix is M = X(X′X)−1X′. The X′X isinvertible because the columns of X are linearly independent. But then using the fact thattrace(AB) = trace(BA),

trace(M) = trace(X(X′X)−1X′) = trace((X′X)−1(X′X)) = trace(Ip) = p, (3.28)

since (X′X) is p × p. 2

Thus we have from (3.27) that

σ2e =

‖e‖2

n − p, p = rank(C(X)). (3.29)

Now everything is set to estimate the standard error of estimates.

3.4 Example: Leprosy

3.4.1 Without covariate

Continue the example from Section 2.2.3. Start with the model without the covariate, (2.59),and consider the first contrast α1 − α2, which compares the two drugs. From (2.60), theleast squares estimate is

α1 − α2 = y1· − y2· = −0.8. (3.30)

This estimate is a′y with a = (1/10, . . . , 1/10,−1/10, . . . ,−1/10, 0, . . . , 0)′, where each valueis repeated ten times. Then

V ar[ α1 − α2] = σ2e‖a‖2 = σ2

e(10(1

10)2 + 10(

1

10)2) =

1

5σ2

e . (3.31)

To estimate σ2e , we need ‖e‖2 and p, the rank of C(X). The columns of X in (2.59)

are not linearly independent, because the first, 1n, is the sum of the other three. Thuswe can eliminate the first, and the remaining are linearly independent, so p = 3. They = (y1·, . . . , y1·, y2·, . . . , y2·, y3·, . . . , y3·)

′, so that

σ2e =

‖e‖2

n − p=

‖y − y‖2

n − p=

∑3i=1

∑10j=1(yij − yi·)

2

n − p=

995.1

30 − 3= 36.856. (3.32)

Now the estimate of the standard error of the estimate α1 − α2 is√

(1/5)σ2e =

√36.856/5 =

2.715. Thus, with the estimate given in (3.30), we have an approximate 95% confidence in-terval for α1 − α2 being

( α1 − α2 ± 2 se) = (−0.8 ± 2(2.715)) = (−6.23, 4.63). (3.33)

3.4. EXAMPLE: LEPROSY 37

This interval is fairly wide, containing 0, which suggests that there is no evidence of adifference between the two drugs. Equivalently, we could look at the approximate z-statistic,

α1 − α2

se=

−0.8

2.715= −0.295, (3.34)

and note that it is well less than 2 in absolute value. (Later, we will refine these inferencesby replacing the “2” with a t-value, at least when assuming normality of the residuals.)

For the contrast comparing the average of the two drugs to the control, (α1 +α2)/2−α3,we again have from (2.60) that

(α1 + α2) − α3 = (y1· + y2·)/2 − y3· = (5.3 + 6.1)/2 − 12.3 = −6.6. (3.35)

The a for this estimate is a = (1/20, . . . , 1/20,−1/10, . . . ,−1/10)′, where there are 20 1/20’sand 10 1/10’s. Thus ‖a‖2 = 20/202 + 10/102 = 0.15, and se =

√36.856 × 0.15 = 2.35, and

the approximate confidence interval is

( (α1 + α2) − α3 ± 2 se) = (−6.6 ± 2(2.35)) = (−11.3,−1.9). (3.36)

This interval is entirely below 0, which suggests that the drugs are effective relative to theplacebo. (Or look at z = −6.6/2.35 = 2.81.)

3.4.2 With covariate

The hope is that by using the before measurements, the parameters can be estimated moreaccurately. In this section we use model (2.61), yij = µ+αi+γ zij +eij. Because the patientswere randomly allocated to the three groups, and the zij ’s were measured before treatment,the µ and αi’s have the same interpretation in both models (with and without covariates).However, the σ2

e is not the same.The projections now are, from (2.67), yij = yi· + d (zij − zi·), where d = γ = 0.987. The

dimension of X is now p = 4, because, after removing the 1n vector, the remaining four arelinearly independent. (They would not be linearly independent if the zij ’s were the samewithin each group, but clearly they are not.) Then

σ2e =

∑3i=1

∑10j=1(yij − yi· − 0.987 (zij − zi·))2

30 − 26= 16.046. (3.37)

Note how much smaller this estimate is than the 36.856 in (3.32).If we were to follow the procedure in the previous section, we would need to find the a’s

for the estimates, then their lengths, in order to find the standard errors. Instead, we will gothe other route, using (3.16) to find the variances: σ2

eλ′(X′X)−1λ. In order to proceed, we

need X′X to be invertible, which at present is not true. We need to place a restriction onthe parameters so that the matrix is invertible, but also in such a way that the the meaning


of the contrasts is the same. One way is to simply set µ = 0, so that

y = Xβ + e =

1 0 0 111 0 0 8...

......

...1 0 0 3− − − −0 1 0 60 1 0 6...

......

...0 1 0 15− − − −0 0 1 160 0 1 13...

......

...0 0 1 12

α1

α2

α3

γ

+ e. (3.38)

Now

X′X =

10 0 0 930 10 0 1000 0 10 129

93 100 129 4122

and (X′X)−1 =

0.2459 0.1568 0.2023 −0.01570.1568 0.2686 0.2175 −0.01690.2023 0.2175 0.3806 −0.0218

−0.0157 −0.0169 −0.0218 0.0017

.

(3.39)For α1 − α2, we have from (2.69) the estimate −0.109. For this β, λ = (1,−1, 0, 0)′, hence

V ar( α1 − α2) = σ2e λ′(X′X)−1λ = σ2

e (0.2459 − 2 × 0.1568 + 0.2686) = σ2e (0.201). (3.40)

Using the estimate in (3.37), we have that

se( α1 − α2) =√

16.046 × 0.201 = 1.796, (3.41)

hence

z =−1.09

1.796= −0.61, (3.42)

which is again quite small, showing no evidence of a difference between drugs.

For the drug versus control contrast, from (2.70), we have the estimate (α1 + α2)/2 − α3 =−3.392. Now λ = (1/2, 1/2,−1, 0)′, hence

se =√

16.046 × (1/2, 1/2,−1, 0)(X′X)−1(1/2, 1/2,−1, 0)′ =√

16.046 × 0.1678 = 1.641.(3.43)

Compare this se to that without the covariate, 2.35. It is substantially smaller, showing thatthe covariate does help improve accuracy in this example.

3.4. EXAMPLE: LEPROSY 39

Now

z =−3.392

1.641= −2.07. (3.44)

This is marginally significant, suggesting there may be a drug effect. However, it is a bitsmaller (in absolute value) that the z = −2.81 calculated without covariates. Thus thecovariate is also important in adjusting for the fact that the controls had somewhat lesshealthy patients initially.


Chapter 4

Distributions: Normal, χ2, and t

The previous chapter presented the basic estimates of parameters in the linear models. In thischapter we add the assumption of normality to the residuals, which allows us to provide moreformal confidence intervals and hypothesis tests. The central distribution is the multivariatenormal, from which the χ2, t, and F are derived.

4.1 Multivariate Normal

The standard normal distribution for random variable Z is the familiar bell-shaped curve.The density is

f(z) =1√2π

e−12

z2

. (4.1)

The mean is 0 and variance is 1. The more general normal distribution, with arbitrarymean µ and variance σ2 is written X ∼ N(µ, σ2), and has density

f(x; µ, σ) =1√2πσ

e−(x−µ)2

2σ2 (4.2)

when σ2 > 0. If σ2 = 0, then X equals µ with probability 1, that is, X is essentially aconstant.

The normal works nicely with affine transformations. That is,

X ∼ N(µ, σ2) =⇒ a + bX ∼ N(a + bµ, b2σ2). (4.3)

We already know that E[a+bX] = a+bµ and V ar[a+bX] = b2σ2, but the added property in(4.3) is that if X is normal, so is a + bX. It is not hard to show using the change-of-variableformula.

We will assume that the residuals ei are independent N(0, σ2e) random variables, which

will imply that the yi’s are independent normals as well. We also need the distributions ofthe vectors such as y, e and β. It will turn out that under the assumptions, the individualcomponents of these vectors are normal, but they are typically not independent (because

41

42 CHAPTER 4. DISTRIBUTIONS: NORMAL, χ2, AND T

their covariance matrices are not diagonal). Thus we need a distribution for the entire vector.This distribution is the multivariate normal, which we now define as an affine transformationof independent standard normals.

Definition 13 An n× 1 vector W has a multivariate normal distribution if for some n× 1vector a and n × q matrix B,

W = a + BZ, (4.4)

where Z = (Z1, . . . , Zq)′, the Zi’s being independent standard normal random variables.

If (4.4) holds, then W is said to be multivariate normal with mean µ and covariance Σ,where µ = a and Σ = BB′, written

W ∼ Nn(µ,Σ). (4.5)

The elements of the vector Z in the definition all have E[Zi] = 0 and V ar[Zi] = 1,and they are independent, hence E[Z] = 0q and Cov[Z] = Iq. Thus, as in (4.3), thatµ = E[W ] = E[a + BZ] = a and Σ = Cov[W ] = Cov[a + BZ] = BB′ follows from (3.8)and (3.10). The added fillip is the multivariate normality. Note that by taking a = 0q andB = Iq, we have that

Z ∼ Nq(0q, Iq). (4.6)

The definition presumes that the distribution is well-defined. That is, two different B’scould yield the same Σ, so how can one be sure the distributions are the same? For example,suppose n = 2, and consider the two matrices

B1 =

(1 0 10 1 1

)and B2 =

√

32

1√2

0√

2

. (4.7)

Certainly

B1B′1 = B2B

′2 =

(2 11 2

)= Σ, (4.8)

but is it clear that

a + B1

Z1

Z2

Z3

and a + B2

(Z1

Z2

)(4.9)

have the same distribution? Not only are they different linear combinations, but they arelinear combinations of different numbers of standard normals. So it certainly is not obvious,but they do have the same distribution. This result depends on the normality of the Zi’s.It can be proved using moment generating functions. Similar results do not hold for otherZi’s, e.g., Cauchy or exponential.

The next question is, “What µ’s and Σ’s are valid?” Any µ ∈ Rn is possible, since a is

arbitrary. But the possible matrices Σ are restricted. For one, BB′ is symmetric for anyB, so Σ must be symmetric, but we already knew that because all covariance matrices aresymmetric. We also need Σ to be nonnegative definite, which we deal with next.

4.1. MULTIVARIATE NORMAL 43

Definition 14 A symmetric n × n matrix A is nonnegative definite if

x′Ax ≥ 0 for all x ∈ Rn. (4.10)

The matrix is positive definite if

x′Ax > 0 for all x ∈ Rn, x 6= 0n. (4.11)

All covariance matrices are nonnegative definite. To see this fact, suppose Cov(Y ) = Σ.Then for any vector x (of the right dimension),

x′Σx = V ar(x′Y ) ≥ 0, (4.12)

because variances are always nonnegative. Not all covariances are positive definite, though.For example, we know that for our models, Cov(y) = σ2

eM, where M is the projection matrixonto M. Now for any x, because M is symmetric and idempotent,

x′Mx = x′M′Mx = x′x = ‖x‖2. (4.13)

Certainly ‖x‖2 ≥ 0, but is it always strictly positive? No. If x ⊥ M, then x = 0n. Thus ifthere are any vectors besides 0n that are orthogonal to M, then M is not positive definite.There always are such vectors, unless M = R

n, in which case M = In.If Cov(Y ) = Σ is not positive definite, then there is a linear combination of the Yi’s, x′Y ,

that has variance 0. That is, x′Y is essentially a constant.The nonnegative definiteness of covariance matrices implies the covariance inequality that

follows, which in turn implies that the correlation between any two random variables is ineth range [−1, 1].

Lemma 2 Cauchy-Schwarz Inequality. For any two random variables Y1 and Y2 withfinite variances,

Cov(Y1, Y2) ≤√

V ar(Y1)V ar(Y2). (4.14)

Thus, if the variances are positive,

−1 ≤ Corr(Y1, Y2) ≤ 1, where Corr(Y1, Y2) =Cov(Y1, Y2)√

V ar(Y1)V ar)Y2)(4.15)

is the correlation between Y1 and Y2.

Proof. Let Σ = Cov((Y1, Y2)′). Because Σ is nonnegative definite, x′Σx ≥ 0 for any x. Two

such x’s are (σ22,−σ12)′ and (−σ12, σ11)

′, which yield the inequalities

σ22(σ11σ22 − σ212) ≥ 0 and

σ11(σ11σ22 − σ212) ≥ 0, (4.16)


respectively. If either σ11 or σ22 is positive, then at least one of the equations shows thatσ2

12 ≤ σ11σ22, which implies (4.14). If σ11 = σ22 = 0, then it is easy to see that σ12 = 0, e.g.,by looking at (1, 1)Σ(1, 1)′ = 2σ12 ≥ 0 and (1,−1)Σ(1,−1)′ = −2σ12 ≥ 0, which imply thatσ12 = 0. 2

Back to matrices BB′. All such matrices are nonnegative definite, because x′(BB′)x =(B′x)′B′x = ‖B′x‖2 ≥ 0. Thus the Σ in Definition 13 must be nonnegative definite. Butagain, all covariance matrices are nonnegative definite. Thus the question is, “Are all nonneg-ative definite symmetric matrices equal to BB′ for some B?” The answer is, “Yes.” There aremany possibilities, but the next subsection shows that there always exists a lower-triangularmatrix L with Σ = LL′. Note that if L works, so does LΓ for any n × n orthogonal matrixΓ.

4.1.1 Cholesky decomposition

There are thousands of matrix decompositions. One that exhibits a B as above is theCholeksy decomposition, for which the matrix B is lower triangular.

Definition 15 An n × n matrix L is lower triangular if lij = 0 for i < j:

L =

l11 0 0 · · · 0l21 l22 0 · · · 0l31 l32 l33 · · · 0...

......

. . ....

ln1 ln2 ln3 · · · lnn

. (4.17)

Some properties:

1. The product of two lower triangular matrices is also lower traingular.

2. The lower triangular matrix L is invertible if and only if the diagonals are nonzero,lii 6= 0. If it exists, the inverse is also lower triangular, and the diagonals are 1/lii.

The main property we need is the following.

Proposition 11 If Σ is symmetric and nonnegative definite, then there exists a lower tri-angular matrix with diagonals lii ≥ 0 such that

Σ = LL′. (4.18)

The L is unique if Σ is positive definite. In addition, Σ is positive definite if and only if theL in (4.18) has all diagonals lii > 0.

4.1. MULTIVARIATE NORMAL 45

Proof. We will use induction on n. The first step is to prove it works for n = 1. In that caseΣ = σ2 and L = l, so the equation (4.18) is σ2 = l2, which is solved by taking l = +

√σ2.

This l is nonnegative, and positive if and only if σ2 > 0.Now assume the decomposition works for any (n − 1) × (n − 1) symmetric nonnegative

definite matrix, and write the n × n matrix Σ as

Σ =

(σ11 σ′

12

σ12 Σ22

), (4.19)

where Σ22 is (n − 1) × (n − 1), and σ12 is (n − 1) × 1. Partion the lower-triangular matrixL similarly, that is,

L =

(l11 0′n−1

l12 L22

), (4.20)

where L22 is an (n − 1) × (n − 1) lower-triangular matrix, and l12 is (n − 1) × 1. We wantto find such an L that satisfies (4.18), which translates to the equations

σ11 = l211σ12 = l11l12

Σ22 = L22L′22 + l12l

′12. (4.21)

It is easy to see that l11 = +√

σ11. To solve for l12, we have to know whether σ11, hence l11,is positive. So there are two cases.

• σ11 > 0: Then l11 > 0, and using the second equation in (4.21), the unique solution isl12 = (1/l11)σ12 = (1/

√σ11)σ12.

• σ11 = 0: By the covariance inequality in Lemma 2, the σ11 = 0 implies that thecovariances between the first variable and the others are all 0, that is, σ12 = 0n−1.Thus in this case we can take l11 = 0 and l12 = 0n−1 in the second equation of (4.21),although any l12 will work.

Now to solve for L22. If σ11 = 0, then as above, σ12 = 0n−1, so that by the inductionhypothesis we have that there does exist a lower-triangular L22 with Σ22 = L22L

′22. If

σ11 > 0, then the third line in (4.21) becomes

Σ22 −1

σ11

σ12σ′12 = L22L

′22. (4.22)

By the induction hypothesis, that equation can be solved if the left-hand side is nonnegativedefinite. For any m×n matrix B, BΣB′ is also symmetric and nonnegative definite. (Why?)Consider the (n − 1) × n matrix

B = (− 1

σ11σ12, In−1). (4.23)


Multiplying out shows that

BΣB′ = Σ22 −1

σ11σ12σ

′12. (4.24)

Therefore (4.22) can be solved with a lower-triangular L22, which by induction proves thatany nonnegative definite matrix Σ can be written as in (4.18) with a lower-triangular L.

There are a couple of other parts to this proposition. We won’t give all the details, butsuppose Σ is positive definite. Then σ11 > 0, and the above proof shows that l11 is positiveand l12 is uniquely determined. Also, Σ22 will be positive definite, so that by induction thediagonals l22, . . . , lnn will also be positive, and the off-diagonals of L22 will be unique. 2

4.2 Some properties of the multivariate normal

The multivariate normal has many properties useful for analyzing linear models. Three ofthem follow.

Proposition 12 1. Affine transformations. If W ∼ Nn(µ,Σ), c is m × 1 and D ism × n, then

c + DW ∼ Nm(c + Dµ,DΣD′). (4.25)

2. Marginals. Suppose W ∼ Nn(µ,Σ) is partitioned as

W =

(W 1

W 2

), (4.26)

where W 1 is n1 × 1 and W 2 is n2 × 1, and the parameters are simlarly partitioned:

µ =

(µ

1

µ2

)and Σ =

(Σ11 Σ12

Σ21 Σ22

), (4.27)

where µiis ni × 1 and Σij is ni × nj. Then

W 1 ∼ Nn1(µ1,Σ11), (4.28)

and similarly for W 2. In particular, the individual components have Wi ∼ N(µi, σii),where σii is the ith diagonal of Σ.

3. Independence. Partitioning W as in part 2, W 1 and W 2 are independent if and onlyif Σ12 = 0. In particular, Wi and Wj are independent if and only if σij = 0.

Part 1 follows from the Definition 13 of multivariate normality. That is, if W ∼ Nn(µ,Σ),then W = µ+BZ where BB′ = Σ and Z is a vector of independent standard normals. Thus

c + DW = (c + Dµ) + (DB)Z, (4.29)

4.3. DISTRIBUTION OF LINEAR ESTIMATES 47

which by definition is Nm(c+Dµ, (DB)(DB)′), and (DB)(DB)′ = DBB′D′ = DΣD′. Part2 follow from part 1 by taking c = 0n1

and D = (In1, 0), where the 0 is n2 × n.To see part 3, let B1 and B2 be matrices so that B1B

′1 = Σ11 and B2B

′2 = Σ22. Consider

W ∗ =

(W ∗

1

W ∗2

)= µ =

(µ

1

µ2

)+

(B1 00 B2

)(Z1

Z2

), (4.30)

where the Z = (Z ′1Z

′2)

′ is a vector of independent standard normals. Now W ∗1 = µ

1+ B1Z1

and W ∗2 = µ

2+ B2Z2 are independent because Z1 and Z2 are. Also, W ∗ ∼ Nn(µ,Σ), where

Σ =

(B1 00 B2

)(B1 00 B2

)′=

(B1B

′1 0

0 B2B′2

)=

(Σ11 00 Σ22

). (4.31)

That is, W ∗ has the same distribution as W when Σ12 = 0, hence W 1 and W 2 (as W ∗1 and

W ∗2) are independent.

4.3 Distribution of linear estimates

We now add the normality assumption to the residuals, which allows development of moredistributional results. As for all assumptions, in practice these are wrong, hence one shouldcheck to see if they are at least reasonable. The assumption is that the ei’s are independentN(0, σ2

e), which means that

y = Xβ + e, e ∼ Nn(0n, σ2eIn), (4.32)

or, equivalently,y ∼ Nn(Xβ, σ2

eIn). (4.33)

The distribution of linear estimates then follows easily from previous work. Thus if λ′βis estimable, and a′y is an unbiased estimate, then because a′y is an affine transformation ofa multivariate normal, part 1 of Proposition 12 shows that

a′y ∼ N(λ′β, σ2e‖a‖2). (4.34)

More generally, if X′X is invertible, then β = (X′X)−1X′y is also an affine transformationof y, hence

β ∼ Np(β, σ2e(X

′X)−1). (4.35)

Projections are linear transformations as well, y = My and e = (In − M)y, hencemultivariate normal. An important result is that two projections on orthogonal spaces, suchas y and e, are independent. To show this result, consider the 2n× 1 vector that strings outthe two projections, (y′, e′)′, which is a linear transformation of y:

(ye

)=

(M

(In − M)

)y. (4.36)


Then

Cov

((ye

))=

(M

(In − M)

)σ2

eIn

(M

(In −M)

)′

= σ2e

(MM′ M(In − M)′

(In − M)M′ (In − M)(In − M)′

)

= σ2e

(M 00 In −M

), (4.37)

because M is idempotent and symmetric (so, e.g., M(In −M)′ = M−MM = M−M = 0).Thus the covariance between y and e is 0, which means that y and e are independent bypart 3 of Proposition 12. (If the residuals are not normal, then these projections will not beindependent in general, but just uncorrelated.)

We will need the next fact for confidence intervals and confidence regions.

Proposition 13 Under model (4.32), if X′X is invertible, β and e are independent.

The proof is easy once you realize that β is a function of y, which follows either by

recalling that β is found by satisfying y = Xβ, or by using the formula β = (X′X)−1X′y,and noting that X′M = X′, hence (X′X)−1X′y = (X′X)−1X′My = (X′X)−1X′y, or justwriting it out:

(X′X)−1X′y = (X′X)−1X′(X(X′X)−1X′)y = (X′X)−1X′y = β. (4.38)

4.4 Chi-squares

Under the normality assumption, if a′y is an unbiased estimate of λ′β,

a′y − λ′β

σe‖a‖∼ N(0, 1). (4.39)

To derive an exact confidence interval for λ′β, start with

P

[−zα/2 <

a′y − λ′β

σe‖a‖< zα/2

]= 1 − α, (4.40)

where zα/2 is the upper (α/2)th cutoff point for the N(0, 1), i.e., P [−zα/2 < N(0, 1) < zα/2] =1 − α. Then rewriting the inequalities in (4.40) so that λ′β is in the center shows that anexact 100 × (1 − α)% confidence interval for λ′β is

a′y ± zα/2σe‖a‖. (4.41)

Unfortunately, the σe is still unknown, so we must estimate it, which then destroys theexact normality in (4.39). It turns out that Student’s t is the correct way to adjust for thisestimation, but first we need to obtain the distribution of σ2

e . Which brings us to the χ2

(chi-squared) distribution.

4.4. CHI-SQUARES 49

Definition 16 If Z ∼ Nν(0ν , Iν), then ‖Z‖2 is distributed as a chi-squared random variablewith ν degrees of freedom, written

‖Z‖2 ∼ χ2ν . (4.42)

A more familiar but equivalent way to write the definition is Z21 +· · ·+Z2

ν ∼ χ2ν , where the

Zi’s are independent N(0, 1)’s. Because E[Zi] = 0, E[Z2i ] = V ar(Zi) = 1, so that E[χ2

ν ] = ν.The chi-squared distribution is commonly used for the distributions of quadratic forms

of normals, where a quadratic form is (y − c)′D−1(y − c) for some vector c and symmetricmatrix D. E.g., the ‖Z‖2 in (4.42) is a quadratic form with c = 0ν and D = Iν . Not allquadratic forms are chi-squared, by any means, but two popular ones are given next. Thefirst is useful for simultaneous confidence intervals on λ′β’s, and the second for the squarednorms of projections, such as ‖e‖2.

Proposition 14 1. If Y ∼ Nn(µ,Σ), and Σ is invertible, then

(Y − µ)′Σ−1(Y − µ) ∼ χ2n. (4.43)

2. If W ∼ Nn(0n,M), where M is (symmetric and) idempotent, then

‖W‖2 ∼ χ2trace(M). (4.44)

Proof. 1. Use the Cholesky decomposition, Proposition 11, to find lower triangular L suchthat Σ = LL′. Because Σ is positive definite, the diagonals of L are positive, hence L−1

exists. By part 1 of Proposition 12,

W = L−1Y − L−1µ ∼ Nn(L−1µ − L−1µ,L−1LL′(L−1)′) = Nn(0n, In), (4.45)

so that‖W‖2 = W ′W = (Y − µ)′(L−1)′L−1(Y − µ) = (Y − µ)′Σ−1(Y − µ), (4.46)

because (L−1)′L−1 = (LL′)−1 = Σ−1. But by Definition 16 of the chi-square, ‖W‖2 ∼ χ2n,

hence (4.43) holds.

2. If M is symmetric and idempotent, then it is a projection matrix for some vector spaceM (= C(M)). Suppose the rank of M is p, and let x1, . . . , xp be an orthonormal basis,which is an orthogonal basis where each vector has length 1, ‖xi‖ = 1. (There always isone: take any basis, and use Gram-Schmidt to obtain an orthogonal basis. Then divide eachvector by its length.) Then

X ≡ (x1, . . . , xp) is n × p, and X′X = Ip, (4.47)

because x′ixj = 0 if i 6= j, x′

ixi = 1, hence

M = X(X′X)−1X′ = XX′. (4.48)


Now

Z = X′W ∼ Np(0p,X′MX) = Np(0p,X

′XX′X) = Np(0p, Ip) =⇒ ‖Z‖2 ∼ χ2p. (4.49)

Equation (4.48), and the fact that M is idempotent, shows that

‖Z‖2 = ‖X′W‖2 = W ′XX′W = W ′MW = (MW )′(MW ). (4.50)

Finally, MW ∼ Nn(0n,M), so MW and W have the same distribution, and (4.44) holds,because p = trace(M). 2

4.4.1 Distribution of ‖e‖2

In the model (4.32), we have that e ∼ Nn(0n, σ2e(In −M)), hence (1/σe)e ∼ Nn(0n, In −M).

Thus by (4.44), since trace(In −M) = n − p,

1

σ2e

‖e‖2 ∼ χ2n−p, (4.51)

hence‖e‖2 ∼ σ2

eχ2n−p, (4.52)

and

σ2e =

1

n − p‖e‖2 ∼ σ2

e

n − pχ2

n−p. (4.53)

4.5 Exact confidence intervals: Student’s t

We now obtain the distribution of (4.39) with σ2e replaced by its estimate. First, we need to

define Student’s t.

Definition 17 If Z ∼ N(0, 1) and U ∼ χ2ν , and Z and U are independent, then

T ≡ Z√Uν

(4.54)

has a Student’s t distribution on ν degrees of freedom, written

T ∼ tν . (4.55)

Proposition 15 Under the model (4.39) with σ2e > 0, if λ′β is estimable and a′y is the least

squares estimate, thena′y − λ′β

σe‖a‖∼ tn−p. (4.56)

4.5. EXACT CONFIDENCE INTERVALS: STUDENT’S T 51

Now is a good time to remind you that if XX′ is invertible, then a′y = λ′β and ‖a‖2 =λ′(XX′)−1λ.

Proof. We know that a′y = a′y ∼ N(λ′β, σ2e‖a‖2), hence as in (4.39),

Z ≡ a′y − λ′β

σe‖a‖∼ N(0, 1). (4.57)

From (4.51),

U ≡ 1

σ2e

‖e‖2 ∼ χ2n−p. (4.58)

Furthermore, y and e are independent by (4.37), hence the Z in (4.57) and U in (4.58) areindependent. Plugging them into the formula for T in (4.54) yields

T =

a′y−λ′β

σe‖a‖√1σ2

e

1n−p

‖e‖2=

a′y − λ′β

σe‖a‖. (4.59)

This statistic is that in (4.56), hence by definition (4.55), it is tν with ν = n − p. 2

To obtain a confidence interval for λ′β, proceed as in (4.40) and (4.40), but use theStudent’s t instead of the Normal, that is, an exact 100 × (1 − α)% confidence interval is

a′y ± tν,α/2σe‖a‖, (4.60)

where the tν,α/2 is found in a t table so that

P [−tν,α/2 < tν < tν,α/2] = 1 − α. (4.61)

Example. Consider the contrast (α1 + α2)/2 − α3 from the Leprosy example, using the

covariate, in Section 3.4.2. The least squares estimate is −3.392, and se =√

σ2eλ

′(X′X)−1λ =1.641. With the covariate, p = 4, and n = 30, hence ν = n − p = 26. Finding a t-table,t26,0.025 = 2.056, hence the 95% confidence interval is

(−3.392 ± 2.056 × 1.641) = (−6.77,−0.02). (4.62)

This interval just barely misses 0, so the effectiveness of the drugs is marginally significant.

Note. This confidence interval is exact if the assumptions are exact. Because we do notreally believe that e or y are multivariate normal, in reality even the t interval is approximate.But in general, if the data are not too skewed or have large outliers. the approximation isfairly good.


Chapter 5

Nested Models

5.1 Introduction

The previous chapters introduced inference on single parameters or linear combinations λ′β.Analysis of variance is often concerned with combined effects, e.g., the treatment effect in theleprosy example (1.26), or the sun/shade effect, fruit effect, or interaction effect in the two-way ANOVA example (1.23). Such effects are often not representable using one parameter,but rather by several parameters, or more generally, by nested vector spaces.

For example, consider the one-way ANOVA model yij = µ + αi + eij , i = 1, 2, 3, j = 1, 2:

y11

y12

y21

y22

y31

y32

= Xβ + e =

1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

µα1

α2

α3

+ e. (5.1)

We now how to assess single contrasts, e.g., α1−α2 or (α1 +α2)/2−α3, but one may wish todetermine whether there is any difference among the groups at all. If there is no difference,then the six observations are as from one large group, in which case the model would be

y = 16µ + e. (5.2)

Letting MA = C(X) be the vector space for model (5.1) and M0 = span{16} be that formodel (5.2), we have that M0 ⊂ MA. Such spaces are said to be nested. Note that model(5.2) can be obtained from model (5.1) by setting some parameters to zero, α1 = α2 = α3 = 0.It is not necessary that that be the case, e.g., we could have represented (5.1) without the

53

54 CHAPTER 5. NESTED MODELS

16 vector,

y11

y12

y21

y22

y31

y32

= X∗β∗ + e =

1 0 01 0 00 1 00 1 00 0 10 0 1

β1

β2

β3

+ e, (5.3)

in which case we still have M0 ⊂ MA = C(X∗), but setting any of the βi’s would not yieldM0. We could do it by setting β1 = β2 = β3, though.

Using the hypothesis testing formulation, we are interested in testing the smaller modelas the null hypothesis, and the larger model as the alternative. Thus with µ = E[y], we aretesting

H0 : µ ∈ M0 versus HA : µ ∈ MA. (5.4)

(Formally, we should not let the two hypotheses overlap, so that HA should be µ ∈ MA −M0.)

The ANOVA approach to comparing two nested models is to consider the squared lengthsof the projections onto M0 and MA, the idea being that the length of a projection representsthe variation in the data y that is “explained” by the vector space. The basic decompositionof the squared length of y based on vector space M0 is

‖y‖2 = ‖y0‖2 + ‖y − y

0‖2, (5.5)

where y0

is the projection of y onto M0. (Recall that this decomposition is due to y0

andy− y

0being orthogonal. It is the Pythagorean Theorem.) The equation in (5.5) is expressed

as

Total variation = Variation due to M0 + Variation unexplained by M0. (5.6)

A similar decomposition using the projection onto MA, yA, is

‖y‖2 = ‖yA‖2 + ‖y − y

A‖2,

Total variation = Variation due to MA + Variation unexplained by MA. (5.7)

The explanatory power of the alternative model MA over the null model M0 can bemeasured in a number of ways, e.g., by comparing the variation due to the two models, orby comparing the variation unexplained by the two models. The most common measuresstart with the variation unexplained by the null model, and look at how much of that isexplained by the alternative. That is, we subtract the variation due to the null model fromthe equations (5.5) and (5.7):

‖y‖2 − ‖y0‖2 = ‖y − y

0‖2 and

‖y‖2 − ‖y0‖2 = ‖y

A‖2 − ‖y

0‖2 + ‖y − y

A‖2, (5.8)

5.1. INTRODUCTION 55

yielding

Variation unexplained by M0 = Variation explained by MA but not by M0

+ Variation unexplained by MA. (5.9)

The larger the “Variation explained by MA but not by M0”, and the smaller the “Variationunexplained by MA”, the more evidence there is that the more complicated model MA isbetter than the simpler model M0. These quantities need to be normalized somehow. Onepopular way is to take the ratio

R2 ≡ Variation explained by MA but not by M0

Variation unexplained by M0=

‖yA‖2 − ‖y

0‖2

‖y − y0‖2

. (5.10)

This quantity is sometimes called the coefficient of determination or the square of the multiplecorrelation coefficient. Usually it is called R-squared.

The squaredness suggests that R2 must be nonnegative, and the “correlation” in the termsuggests it must be no larger than 1. Both suggestions are true. The next section looks moreclosely at these sums of squares.

5.1.1 Note on calculating sums of squares

The sums of squares as in (5.5) for a generic model y = Xβ + e can be obtained by findingthe y explicitly, then squaring the elements and summing. When X′X is invertible, thereare more efficient ways, although they may not be as stable numerically. That is, once onehas β and X′X calculated, it is simple to use

‖y‖2 = y′y = (Xβ)′Xβ = β′X′Xβ. (5.11)

Then‖y − y‖2 = ‖y‖2 − β

′X′Xβ. (5.12)

These formulas are especially useful if p, the dimension of the β vector, is small relative ton.

As a special case, suppose X = 1n. Then β = y and X′X = n, so that (5.12) is

‖y − y‖2 = ‖y‖2 − y(n)y, (5.13)

i.e.,n∑

i=1

(yi − y)2 =n∑

i=1

y2i − ny2, (5.14)

the familiar “machine formula” used for calculating the sample standard deviation.These days, typical statistical programs use efficient and accurate routines for calculating

linear model quantities, so that the efficiency of formula (5.12) is of minor importance to us.Conceptually, it comes in handy, though.


5.2 Orthogonal complements

We are interested in the part of yA

that is not part of M0, e.g., the difference in projectionsy

A− y

0. It turns out that this is also a projection of y onto some subspace.

Definition 18 Suppose that M0 ⊂ MA. Then the orthogonal complement of M0

relative to MA, denoted MA·0, is the set of vectors in MA that are orthogonal to M0, thatis,

MA·0 = {x ∈ MA | x ⊥ M0} = MA ∩M⊥0 . (5.15)

It is not hard to prove that MA·0 is a vector space. In fact, any intersection of vectorspaces is also a vector space. The projection onto MA·0 is just the difference of the projectionson the two spaces.

Proposition 16 If M0 ⊂ MA, then the projection of y onto MA·0 is yA− y

0, hence the

projection matrix MA·0 is MA −M0, where y0

and M0 (yA

and MA) are the projection ontoM0 (MA) and corresponding projection matrix.

Proof. To see that yA− y

0∈ MA·0, note first that y

A− y

0∈ MA because both individual

projections are in MA. Second, to see that yA− y

0∈ M⊥

0 , note that yA− y

0= (y − y

0) −

(y− yA). Then by the definition of projection, (y − y

0) ∈ M⊥

0 and (y − yA) ∈ M⊥

A. Because

M0 ⊂ MA, M⊥A ⊂ M⊥

0 (Why?), hence (y − yA) ∈ M⊥

0 , so that yA− y

0∈ M⊥

0 .

Next, we need to show y − (yA− y

0) ∈ M⊥

A·0. Write y − (yA− y

0) = (y − y

A) − y

0.

(y − yA) ∈ M⊥

A ⊂ M⊥A·0, because MA·0 ⊂ MA. Also, if x ∈ MA·0, then x ⊥ M0, hence

x ⊥ y0, which means that y

0∈ M⊥

A·0. Thus y − (yA− y

0) ∈ M⊥

A·0. 2

Write

y − y0

= (yA− y

0) + (y − y

A). (5.16)

Because yA− y

0∈ MA·0 and y − y

A⊥ MA, they are orthogonal, hence

‖y − y0‖2 = ‖y

A− y

0‖2 + ‖y − y

A‖2. (5.17)

Thus by (5.8), ‖yA‖2 − ‖y

0‖2 = ‖y

A− y

0‖2, and by (5.10),

R2 =‖y

A− y

0‖2

‖y − y0‖2

, (5.18)

Now (5.17) also shows that ‖yA− y

0‖2 ≤ ‖y − y

0‖2, hence indeed we have that 0 ≤ R2 ≤ 1.

5.2. ORTHOGONAL COMPLEMENTS 57

5.2.1 Example

Consider the Leprosy example, with the covariate, so that the model is as in (2.61),

y11

y12...

y1,10

−y21

y22...

y2,10

−y31

y32...

y3,10

= XAβA

+ e =

1 1 0 0 z11

1 1 0 0 z12...

......

......

1 1 0 0 z1,10

− − − − −1 0 1 0 z21

1 0 1 0 z22...

......

......

1 0 1 0 z2,10

− − − − −1 0 0 1 z31

1 0 0 1 z32...

......

......

1 0 0 1 z3,10

µα1

α2

α3

γ

+ e (5.19)

The large model is then MA = C(XA). Consider the smaller model to be that withouttreatment effect. It can be obtained by setting α1 = α2 = α3 = 0 (or equal to any constant),so that M0 = span{130, z}. From (2.67) we know that y

Ahas elements

yAij = yi· + 0.987 (zij − zi·). (5.20)

Notice that model M0 is just a simple linear regression mode, yij = α + βzij + eij , so we

know how to estimate the coefficients. They turn out to be α = −3.886 and β = 1.098, so

y0ij = −3.886 + 1.098zij. (5.21)

To find the decompositions, we first calculate ‖y‖2 = 3161. For model MA, we wouldlike to use formula (5.12), but need X′X invertible, which we do by dropping the 130 vector

from XA, so that we have the model (3.38). From (2.65) and (2.68), we can calculate the β∗

(without the µ) to be

β∗

=

−3.881−3.772−0.435

0.987

(5.22)

The X∗′X∗ matrix is given in (3.39), hence

‖yA‖2 =

−3.881−3.772−0.435

0.987

′

10 0 0 930 10 0 1000 0 10 129

93 100 129 4122

−3.881−3.772−0.435

0.987

= 2743.80. (5.23)


Then ‖y − yA‖2 is the difference, 3161 − 2743.80:

‖y‖2 = ‖yA‖2 + ‖y − y

A‖2;

3161 = 2743.80 + 417.20.(5.24)

The decomposition for M0 is similar. We can find X′0X0, where X0 = (130, z), from X∗′X∗

by adding the first three diagonals, and adding the first three elements of the first column:

X′0X0 =

(30 322322 4122

). (5.25)

Then

‖y0‖2 =

(−3.8861.098

)′ (30 322322 4122

)(−3.8861.098

)= 2675.24, (5.26)

hence‖y‖2 = ‖y

0‖2 + ‖y − y

0‖2;

3161 = 2675.24 + 485.76.(5.27)

The decomposition of interest then follows easily by subtraction:

‖y − y0‖2 = ‖y

A− y

0‖2 + ‖y − y

A‖2;

485.76 = 68.56 + 417.20.(5.28)

The R2 is then

R2 =‖y

A− y

0‖2

‖y − y0‖2

=68.56

485.76= 0.141. (5.29)

That says that about 14% of the variation that the before measurements fail to explain isexplained by the treatments. It is fairly small, which means there is still a lot of variationin the data not explained by the difference between the treatments. It may be that thereare other variables that would be relevant, such as age, sex, weight, etc., or that bacterialcounts are inherently variable.

5.3 Mean squares

Another popular ratio is motivated by looking at the expected values of the sums of squaresunder the two models. We will concentrate on the two difference vectors that decomposey− y

0in (5.17), y

A− y

0and y− y

A. There are two models under consideration, the null M0

and the alternative MA. Thus there are two possible distributions for y:

M0 : y ∼ Nn(µ0, σ2

eIn) for some µ0∈ M0;

MA : y ∼ Nn(µA, σ2

eIn) for some µA∈ MA.

(5.30)

The two difference vectors are independent under either model by the next proposition,because they are the projections on two orthogonal spaces, MA·0 and M⊥

A.

5.3. MEAN SQUARES 59

Proposition 17 Suppose y ∼ Nn(µ, σ2eIn), and M1 and M2 are orthogonal vector spaces.

Then y1

and y2, the projections of y onto M1 and M2, respectively, are independent.

Proof. Letting M1 and M2 be the respective projection matrices, we have that M1M2 = 0,because for any x, M2x ∈ M2 so is orthogonal to M1, hence M1M2x = 0. Likewise,M2M1 = 0. Thus

Cov

[(y

1

y2

)]= Cov

[(M1

M2

)y

]

= σ2e

(M1

M2

)(M1

M2

)′

= σ2e

(M1 00 M2

). (5.31)

The projections are thus independent because they are uncorrelated and multivariate normal.2

The means of the vectors may depend on the model. Below are the distributions underthe two models:

M0 MA

yA− y

0Nn((MA −M0)µ0

, σ2e(MA −M0)) Nn((MA − M0)µA

, σ2e(MA −M0))

y − yA

Nn((In − MA)µ0, σ2

e(In − MA)) Nn((In −MA)µA, σ2

e(In − MA))(5.32)

Three of those means are zero: Because µ0

is in M0, it is also in MA, hence M0µ0= µ

0and

MAµ0

= µ0. Thus (MA − M0)µ0

= 0n and (In − MA)µ0

= 0n. Also, MAµA

= µA, so that

(In −MA)µA

= 0n. Thus all but the upper right have zero mean:

M0 MA

yA− y

0Nn(0n, σ

2e(MA − M0)) Nn((MA −M0)µA

, σ2e(MA − M0))

y − yA

Nn(0n, σ2e(In − MA)) Nn(0n, σ2

e(In −MA))(5.33)

Notice that no matter which model is true, y − yA

has the same distribution, with mean0n. On the other hand, y

A− y

0has zero mean if the null model is true, but (potentially)

nonzero mean if the alternative model is true. Thus this vector contains the information tohelp decide whether the null hypothesis is true.

We are actually interested in the sums of squares, so consider the expected values of thesquared lengths of those vectors. For any random variable Wi, E[W 2

i ] = V ar[Wi] + E[Wi]2

(Why?), hence for the vector W ,

E[‖W‖2] =n∑

i=1

E[W 2i ]

=n∑

i=1

V ar[Wi] +n∑

i=1

E[Wi]2

= trace(Cov(W )) + E[W ]′E[W ]. (5.34)


For the covariance matrices in the table (5.33), we know trace(MA) = pA and trace(M0) =p0, where pA and p0 are the ranks of the respective vector spaces. Thus we have

Expected sum of squares M0 MA

ESSA·0 ≡ E[‖yA− y

0‖2] σ2

e(pA − p0) σ2e(pA − p0) + µ′

A(MA −M0)µA

ESSE ≡ E[‖y − yA‖2] σ2

e(n − pA) σ2e(n − pA)

(5.35)

The “ESS” means expected sum of squares, and the “ESSE” means expected sumof squares of errors.

Now for the question: If the null hypothesis is not true, then E[‖yA− y

0‖2] will be

relatively large. How large is large? One approach is the R2 idea from the previous section.Another is to notice that if the null hypothesis is true, how large E[‖y

A− y

0‖2] is depends

on σ2e (and n − pA). Thus we could try comparing those. The key is to look at expected

mean squares, which are obtained from table (5.35) by dividing by the degrees of freedom:

Expected mean squares M0 MA

EMSA·0 ≡ E[‖yA− y

0‖2]/(pA − p0) σ2

e σ2e + µ′

A(MA − M0)µA

/(pA − p0)EMSE ≡ E[‖y − y

A‖2]/(n − pA) σ2

e σ2e

(5.36)

The EMSE means “expected mean square error”. One further step simplifies even more:Take the ratio of the expected mean squares:

Ratio of expected mean squares M0 MA

EMSA·0

EMSE1 1 +

µ′

A(MA−M0)µ

A

σ2e (pA−p0)

(5.37)

Now we have (sort of) answered the question: How large is large? Larger than 1. Thatis, this ratio of expected mean squares is 1 if the null hypothesis is true, and larger than 1 ifthe null hypothesis is not true. How much larger is semi-complicated to say, but it dependson µ

Aand σ2

e .That is fine, but we need to estimate this ratio. We will use the analogous ratio of mean

squares, that is, just remove the “E[ · ]”’s. This ratio is called the F ratio, named afterR. A. Fisher:

Definition 19 Given the above set up, the F ratio is

F =MSA·0MSE

=‖y

A− y

0‖2/(pA − p0)

‖y − yA‖2/(n − pA)

. (5.38)

Notice that EMSE is actually σ2e for the model MA.

The larger F , the more evidence we have for rejecting the null hypothesis in favor of thealternative. The next section will deal with exactly how large is large, again. But first wecontinue with the example in Section 5.2.1. From (5.28) we obtain the sums of squares. Then = 30, p0 = 2 and pA = 4, hence

MSA·0 =68.56

4 − 2= 34.28, MSE =

417.20

30 − 4= 16.05, F =

34.28

16.05= 2.14. (5.39)

5.4. THE F DISTRIBUTION 61

That F is not much larger than 1, so there does not seem to be a very significant treatmenteffect. The next section shows how to calculate the significance level.

For both measures R2 and F , the larger, the more one favors the alternative model. Thesemeasures are in fact equivalent in the sense of being monotone functions of each other:

R2 =‖y

A− y

0‖2

‖y − y0‖2

=‖y

A− y

0‖2

‖yA− y

0‖2 + ‖y − y

A‖2

, (5.40)

so thatR2

1 − R2=

‖yA− y

0‖2

‖y − yA‖2

, (5.41)

andn − pA

pA − p0

R2

1 − R2=

‖yA− y

0‖2/(pA − p0)

‖y − yA‖2/(n − pA)

= F. (5.42)

5.4 The F distribution

Consider the F statistic in (5.42). We know that the numerator and denominator are inde-pendent, by (5.31). Furthermore, under the null model M0 in (5.33), from Proposition 14,Equation (4.44), we have as in Section 4.4.1 that

‖yA− y

0‖2 ∼ σ2

eχ2pA−p0

and ‖y − yA‖2 ∼ σ2

eχ2n−pA

, (5.43)

so that the distribution of F can be given as

F ∼ σ2eχ

2pA−p0

/(pA − p0)

σ2eχ

2n−pA

/(n − pA)=

χ2pA−p0

/(pA − p0)

χ2n−pA

/(n − pA), (5.44)

where the χ2’s are independent. In fact, that is the definition of the F distribution.

Definition 20 If U1 ∼ χ2ν1

and U2 ∼ χ2ν2

, and U1 and U2 are independent, then

F ≡ U1/ν1

U2/ν2(5.45)

has an F distribution with degrees of freedom ν1 and ν2. It is written

F ∼ Fν1,ν2. (5.46)

Then, according to the definition, when M0 is the true model,

F =MSA·0MSE

=‖y

A− y

0‖2/(pA − p0)

‖y − yA‖2/(n − pA)

∼ FpA−p0,n−pA. (5.47)


Note. When MA is true, then the numerator and denominator are still independent, and thedenominator is still χ2

n−pA/(n− pA), but the SSA·0 is no longer χ2. In fact, it is noncentral

chi-squared, and the F is noncentral F . We will not deal with these distributions, except tosay that they are “larger” than their regular (“central”) cousins.

We can now formally test the hypotheses

H0 : µ ∈ M0 versus HA : µ ∈ MA, (5.48)

based on y ∼ Nn(µ, σ2eIn). (We assume σ2

e > 0. Otherwise, y = µ, so it is easy to test thehypotheses with no error.) For level α, reject the null hypothesis when

F > Fν1,ν2,α, (5.49)

where F is in (5.47), and Fν1,ν2,α is the upper α cutoff point of the F distribution:

P [FpA−p0,n−pA> Fν1,ν2,α] = α. (5.50)

There are tables of these cutoff points, and most statistical software will produce them.

Example. Continuing the leprosy example, from (5.39) we have that F = 2.16. Also,pA − p0 = 2 and n − pA = 26, hence we reject the null hypothesis (M0) of no treatmenteffect at the α = 0.05 level if

F > F2,26.0.05 = 3.369. (5.51)

Because 2.16 is less than 3.369, we cannot reject the null hypothesis, which means there isnot enough evidence to say that there is a treatment effect.

Does this conclusion contradict that from the confidence interval in (4.62) for (α1 +α2)/2 − α3, which shows a significant difference between the average of the two drugs andthe placebo? Yes and no. The F test is a less focussed test, in that it is looking for anydifference among the three treatments. Thus it is combining inferences for the drug versusplacebo contrast, which is barely significant on its own, and the drug A versus drug Dcontrast, which is very insignificant. The combining drowns out the first contrast, so thatoverall there does not appear to be anything significant. More on this phenomenon when weget to simultaneous inference.

5.5 The ANOVA table

The ANOVA table is based on a systematic method for arranging the important quantitiesin comparing two nested model, taking off from the decomposition of the sums of squaresand degrees of freedom. That is, write

‖y − y0‖2 = ‖y

A− y

0‖2 + ‖y − y

A‖2

n − p0 = pA − p0 + n − pA(5.52)

in table form. The more generic form is

5.5. THE ANOVA TABLE 63

Source Sum of squares Degrees of freedom Mean square FMA·0 ‖(MA −M0)y‖2 trace(MA − M0) SSA·0/dfA·0 MSA·0/MSEM⊥

A ‖(In −MA)y‖2 trace(In − MA) SSE/dfE —M⊥

0 ‖(In − M0)y‖2 trace(In − M0) — —

The “source” refers to a vector space, the “sum of squares” is the squared length of theprojection of y on that space, the “degrees of freedom” is the rank of the space, and the“mean square” is the sum of squares divided by the rank. To be a valid ANOVA table, thefirst two rows of the sum of squares and degrees of freedom columns add to the third row.

Writing the table more explicitly, we have

Source Sum of squares Degrees of freedom Mean square FMA·0 ‖y

A− y

0‖2 pA − p0 SSA·0/dfA·0 MSA·0/MSE

M⊥A ‖y − y

A‖2 n − pA σ2

e = SSE/dfE —

M⊥0 ‖y − y

0‖2 n − p0 — —

It is also common to add the R2,

R2 =‖y

A− y

0‖2

‖y − y0‖2

. (5.53)

In practice, more evocative names are given to the sources. Typically, the M⊥A space is

called “Error” and the bottom space M⊥0 is called “Total.” In a one-way ANOVA, the MA·0

space may be called “Group effect”, or may refer to the actual groups. For example, theANOVA table for the leprosy example would be

Source Sum of squares Degrees of freedom Mean square FTreatment 68.56 2 34.28 2.14Error 417.20 26 16.05 —Total 485.76 28 — —

R2 = 0.141

For another example, consider simple linear regression yi = α + βxi + ei, and let MA =C(X) and M0 = span{1n}, so that we are testing whether β = 0, that is, whether the xi’sare related to the yi’s. Now (from Question 6 of Homework #5, for example),

‖y − y0‖2 =

n∑

i=1

(yi − y)2, (5.54)

‖yA− y

0‖2 = β2

n∑

i=1

(xi − x)2 =(∑n

i=1(yi − y)(xi − x))2

∑ni=1(xi − x)2

∑ni=1(yi − y)2

, (5.55)

p0 = 1 and pA = 2. Data on 132 male athletes, with x = height and y = weight, hasα = 657.30 and β = −5.003 (which may seem strange, being negative). The ANOVA tableis


Source Sum of squares Degrees of freedom Mean square FRegression 63925 1 63925 11.65Error 713076 130 5485 —Total 777001 131 — —

R2 = 0.082

The F1,130,0,05 = 3.91, so that the β is very significant. On the other hand, R2 is quitesmall, suggesting there is substantial variation in the data. It is partly because this modeldoes not take into account the important factor that there are actually two sports representedin the data.

Chapter 6

One-way ANOVA

This chapter will look more closely at the one-way ANOVA model. The model has g groups,and Ni observations in group i, so that there are n = N1 + · · · + Ng observations overall.Formally, the model is

yij = µ + αi + eij, i = 1, . . . , g; j = 1, . . . , Ni, (6.1)

where the eij ’s are independent N(0, σ2e)’s. Written in matrix form, we have

y = Xβ + e =

1N11N1

0N1· · · 0N1

1N20N2

1N2· · · 0N2

......

.... . .

...1Ng

0Ng0Ng

· · · 1Ng

µα1

α2...

αg

+ e, e ∼ Nn(0n, σ2eIn). (6.2)

The ANOVA is called balanced if there is the same number of observations in eachgroup, that is, Ni = N , so that n = Ng. Otherwise, it is unbalanced. The balanced caseis somewhat easier to analyze than the unbalanced one, but the difference is more evidentin higher-way ANOVA’s. See the next chapter.

The next section gives the ANOVA table. Section 6.2 shows how to further decomposethe group sum of squares into components based on orthogonal contrasts. Section 6.3 looksat “effects” and gives constraints on the parameters to make them estimable. Later, inChapter 8, we deal with the thorny problem of multiple comparisons: A single confidenceinterval may have a 5% chance of missing the parameter, but with many confidence intervals,each at 95%, the chance that at least one misses its parameter can be quite high. E.g., with100 95% confidence intervals, you’d expect about 5 to miss. How can you adjust so that thechance is 95% that they are all ok?

6.1 The ANOVA table

From all the work so far, it is easy to find the ANOVA table for testing whether there areany group effects, that is, testing whether the group means are equal. Here, MA = C(X)

65

66 CHAPTER 6. ONE-WAY ANOVA

for the X in (6.2), and M0 = span{1n}. The ranks of these spaces are, respectively, pA = gand p0 = 1. The projections have

yAij

= yi· and y0ij

= y, (6.3)

where

yi· =1

Ni

Ni∑

j=1

yij (6.4)

is the sample mean of the observations in group i. Then the sums of squares are immediate:

SSA·0 = ‖yA− y

0‖2 =

g∑

i=1

Ni∑

j=1

(yi· − y)2 =g∑

i=1

Ni(yi· − y)2

SSE = ‖y − yA‖2 =

g∑

i=1

Ni∑

j=1

(yij − yi·)2

SST = ‖y − y0‖2 =

g∑

i=1

Ni∑

j=1

(yij − y)2. (6.5)

(The SST means “sum of squares total”.) Often, the SSA·0 is called the between sum ofsquares because it measures the differences between the group means and the overall mean,and the SSE is called the within sum of squares, because it adds up the sums of squaresof the deviations from each group’s mean. The table is then

Source Sum of squares Degrees of freedom Mean square FBetween

∑gi=1 Ni(yi· − y)2 g − 1 MSB MSB/MSW

Within∑g

i=1

∑Nij=1(yij − yi·)

2 n − g MSW —

Total∑g

i=1

∑Nij=1(yij − y)2 n − 1 — —

R2 =∑g

i=1Ni(yi·−y)2∑g

i=1

∑Nij=1

(yij−y)2.

6.2 Decomposing the between sum of squares

When there are more than two groups, the between sum of squares measure a combinationof all possible differences among the groups. It is usually informative to be more specificabout differences. One approach is to further decompose the between sum of squares usingorthogonal contrasts. A contrast of a vector of parameters is a linear combination inwhich the coefficients sum to zero. For example, in the leprosy data we looked at contrastsof the group means (µ1, µ2, µ3)

′, or, equivalently, the (α1, α2, α3)′:

γ1 =1

2(α1 + α2) − α3 = (

1

2,1

2,−1)(α1, α2, α3)

′, and

γ2 = α1 − α2 = (1,−1, 0)(α1, α2, α3)′. (6.6)

6.2. DECOMPOSING THE BETWEEN SUM OF SQUARES 67

These two parameters are enough to describe any differences between the αi’s, which is tosay that they can be used as parameters in place of the αi’s in the ANOVA model:

y = X∗β∗ + e =

11013110

12110

11013110 −1

2110

110 −23110 010

µγ1

γ2

+ e. (6.7)

[The model comes from the model (5.1) by solving for the αi’s in terms of the γi’s in theequations (6.6) plus α1 + α2 + α3 = 0.] This model (6.7) is indeed the regular one-wayANOVA model (5.1), that is, MA = C(X∗) = C(X). Note also that the columns of X∗ areorthogonal, which in particular means that

MA·0 = span

13110

13110

−23110

,

12110

−12110

010

. (6.8)

More generally, suppose that there are orthogonal and nonzero vectors x1, . . . , xq suchthat

MA·0 = span{x1, . . . , xq}. (6.9)

(So that in the one-way ANOVA case, q = g − 1.) The projection onto MA·0 can bedecomposed into the projections on the one-dimensional spaces M1, . . . ,Mq, where

Mk = span{xk}. (6.10)

That is, if yA·0 is the projection of y onto MA·0, and y

kis the projection onto Mk, k =

1, . . . , q, theny

A·0 = y1+ · · · + y

q, (6.11)

and, because the Mk’s are mutually orthogonal,

‖yA·0‖

2 = ‖y1‖2 + · · ·+ ‖y

q‖2. (6.12)

Proof of (6.11). We will show that z ≡ y1

+ · · · + yq

is the projection onto MA·0. First,

z ∈ MA·, since it is a linear combination of the xk’s, which are all in MA·0. Next, considery − z. To see that it is orthogonal to MA·0, it is enough to show that it is orthogonal toeach xk, by (6.9). To that end,

(y − z)′xk = −y′1xk − · · · − y′

k−1xk + (y − y

k)′xk − y′

k+1xk − · · · y′

qxk. (6.13)

But y′jxk = 0 if j 6= k, because the xj’s are orthogonal, and (y − y

k)′xk = 0, because y

kis

the projection of y onto Mk. Thus (y − z)′xk, which means z = yA·0. 2

We know that the projection matrix for Mk is

Mk = xk(x′kxk)

−1x′k =

xkx′k

‖xk‖2, (6.14)


hence

‖yk‖2 = ‖Mky‖2 = y′Mky =

(y′xk)2

‖xk‖2. (6.15)

The decomposition (6.12) then leads to an expanded ANOVA table, inserting a row for eachMk:


M1 (y′x1)2/‖x1‖2 1 MS1 MS1/MSW

......

......

...Mg−1 (y′xg−1)

2/‖xg−1‖2 1 MSg−1 MSg−1/MSW

Within∑g

i=1

∑Nij=1(yij − yi·)

2 n − g MSW —

Total∑g

i=1

∑Nij=1(yij − y)2 n − 1 — —

Because the rank(Mk) = 1 for these vector spaces, the degrees of freedom are all 1, andthe SSk = MSk. Proposition 17 shows that the y

k’s are independent (because MkMj = 0

if k 6= j), and

yk∼ Nn(Mk(µA

− µ0), σ2

eMk). (6.16)

Thus under the null hypothesis, since trace(Mk) = 1,

‖y1‖2, . . . , ‖y

g−1‖2 are independent σ2

eχ21’s. (6.17)

Thus the Fk = MSk/MSW are indeed F1,n−g’s, although they are not quite independentdue to the common denominator MSW .

These sums of squares can also be written in terms of the estimates of parameters in thegeneralization of the model (6.7),

y = X∗β∗ + e = (1n, x1, . . . , xq)

µγ1...γq

+ e. (6.18)

By orthogonality of the columns of X∗, we have that

β∗

= (X∗′X∗)−1X∗′y =

yx′

1y/‖x1‖2

...x′

qy/‖xq‖2.

(6.19)

Then with γk = x′ky/‖xk‖2,

‖yk‖2 =

(y′xk)2

‖xk‖2= ‖xk‖2γ2

k. (6.20)

6.2. DECOMPOSING THE BETWEEN SUM OF SQUARES 69

6.2.1 Example

Return to the one-way ANOVA model without covariate, (5.1), for the leprosy data. Thetable below has the basic statistics:

Group i yi·∑10

j=1(yij − yi·)2

1. Drug A 5.3 194.12. Drug D 6.1 340.93. Placebo 12.3 460.1

(6.21)

Then the SSW = 194.1 + 340.9 + 460.1 = 995.1, and with y = 7.9,

SSB =3∑

i=1

10∑

i=1

(yi· − y)2 = 10[(5.3 − 7.9)2 + (6.1 − 7.9)2 + (12.3 − 7.9)2] = 293.6. (6.22)

Hence the regular ANOVA table, with g = 3, is

Source Sum of squares Degrees of freedom Mean square FBetween 293.6 2 146.8 3.98Within 995.1 27 36.86 —Total 1288.7 29 — —

R2 = 0.2278.

The F2,27,0.05 = 3.354, so that there is a significant group effect. Recall from Section 5.5that when using the covariate, the group differences are not significant.

Now we decompose the SSB, using the x1 and x2 as in (6.8). From (6.6),

γ1 = (α1 + α2)/2 − α3 = (5.3 + 6.1)/2 − 12.3 = −6.6, and

γ2 = α1 − α2 = 5.3 − 6.1 = −0.8. (6.23)

Thus

SS1 = ‖x1‖2γ21 = (10(1/3)2 + 10(1/3)2 + 10(−2/3)2)(−6.6)2 = 290.4, and

SS2 = ‖x2‖2γ22 = (10(1/2)2 + 10(−1/2)2)(−0.8)2 = 3.2. (6.24)

Note that indeed, SSB = 293.6 = SS1 + SS2 = 290.4 + 3.2. The expanded ANOVA table isthen


Drug vs Placebo 290.4 1 290.4 7.88Drug A vs Drug D 3.2 1 3.2 0.09

Within 995.1 27 36.86 —Total 1288.7 29 — —

Now F1,27,0.05 = 4.21, so that the Drug vs Placebo contrast is quite significant, but thecontrast between the two drugs is completely nonsignificant.


6.2.2 Limitations of the decomposition

Every subspace has an orthogonal basis, many essentially different such bases if the rankis more than one. Unfortunately, the resulting subspaces Mk need not be ones of interest.In the leprosy example, the two subspaces nicely corresponded to two interesting contrasts.Such nice results will occur in the balanced one-way ANOVA if one is interested in a set oforthogonal contrasts of the αi’s. A contrast of α = (α1, . . . , αg)

′ is a linear combinationc′α, where c is any nonzero g× 1 vector c such that c1 + · · ·+ cg = 0. Two contrasts c′1α andc′2α are orthogonal if their vectors c1 and c2 are orthogonal. For example, the γi’s in (6.6)are orthogonal contrasts of α with

c1 = (1

2,1

2,−1)′ and c2 = (1,−1, 0)′. (6.25)

It is easy to see that these two vectors are orthogonal, and their components sum to 0.If the contrasts of interest are not orthogonal, then their respective sums of squares do

not sum to the between sum of squares. E.g., we may be interested in comparing each drugto the placebo, so that the contrast vectors are (1, 0,−1)′ and (0, 1,−1)′. Although theseare contrasts, the two vectors are not orthogonal. Worse, if the model is unbalanced, evenorthogonal contrasts will not translate back to an orthogonal basis for MA·0. Also, the modelwith covariates will not allow the decomposition, even with a balanced design.

On the positive side, balanced higher-way ANOVA models do allow nice decompositions.

6.3 Effects

We know that the general one-way ANOVA model (6.1), (6.2), is often parametrized in sucha way that the parameters are not estimable. That usually will not present a problem,because interest is one contrasts of the αi’s, which are estimable, or (equivalently) testingwhether there are any differences among the groups. Alternatively, one can place constraintson the parameters so that they are estimable. E.g., one could set µ = 0, or αG = 0, orα1 + · · · + αG = 0. One method for setting constraints is to define effects for the groups.Let µi be the population mean of group i, i.e., µi = µ+αi. The idea is to have a benchmarkvalue µ∗, being some weighted average of the µi’s. Then the effect of a group is the amountthat group’s mean exceeds µ∗, that is, the group i effect is defined to be

αi = µi − µ∗, µ∗ =g∑

i=1

wiµi, (6.26)

where the wi’s are nonnegative and sum to 1. Backing up, we see that (6.26) implies theconstraint that µ = µ∗, or equivalently, that

g∑

i=1

wiαi = 0. (6.27)

6.3. EFFECTS 71

One choice for µ∗ is the unweighted average of the µi’s, that is, wi = 1/g, so thatµ∗ =

∑µi/g. The resulting constraint (6.27) is that the unweighted average of the αi’s is 0,

which is the same as saying the sum of the αi’s is 0. The least squares estimate of µi is yi·,so that this constraint leads to the estimates

µ =

∑gi=1 yi·g

and αi = yi· − µ. (6.28)

In the balanced case, (6.28) becomes

µ =

∑gi=1(

∑Nj=1 yij/N)

g=

∑gi=1

∑Nj=1 yij

n= y·· and αi = yi· − y··. (6.29)

That is, µ is the straight average of the yij’s. In this case, the SSB in (6.5) can be easilywritten as a function of the estimated effects,

SSB =g∑

i=1

N∑

j=1

(yi· − y··)2 = N

g∑

i=1

α2i . (6.30)

In the unbalanced case, (6.29) is not true, hence neither is (6.30). An alternative is toweight the µi’s by the numbers of observations in the groups, i.e., wi = Ni/n. Then

µ =

∑gi=1 Niµi

n(hence

g∑

i=1

Niαi = 0), (6.31)

and

µ =

∑Gi=1 Niyi·

n=

∑gi=1

∑Nij=1 yij

n= y··, αi = yi· − y··, (6.32)

and (6.30) does hold. One objection to using the weighting (6.31) is that groups that happento have more observations are more strongly represented in the benchmark, hence their effectwould seem smaller. E.g., in an extreme case with g = 2, suppose N1 = 99 and N2 = 1.Then α2 = −99α1, so that the effect of the second group will always look much larger thanthat for group 1. Using the unweighted mean (6.28), α2 = −α1, so the groups are treatedequally.

Another approach is to weight the groups according to their representation in the pop-ulation, so that µ would be the population mean of y. This approach assumes there is anexisting population, e.g., the groups are men and women, or based on age, etc. It would notbe relevant in the leprosy example, since the population is not taking Drug A, Drug D, anda placebo.


Chapter 7

Two-way ANOVA

Now observations are categorized according to two variables, e.g., amount of sun/shade andtype of fruit for the data in (1.23). The table contains the leaf area/dry weight for the citrustrees:


(7.1)

(From Table 11.2.1 in Statistical Methods by Snedecor and Cochran.) Writing it in termsof yij.

Orange Grapefruit MandarinSun y11 y12 y13

Half − shade y21 y22 y23

Shade y31 y32 y33

(7.2)

This is a 3 × 3 layout: three rows and three columns. More generally, one has r rows and ccolumns, for an “r × c layout,” and there could be more than one observation in each cell.For example, here is a 2 × 3 layout with varying numbers of observations per cell:

Column 1 Column 2 Column 3Row 1 y111, y112, y113 y121, y122 y131, y132, y133, y134

Row 2 y211 y221, y222, y223 y231, y232, y233

(7.3)

The yijk is the kth observation in the ith row and jth column. The most general modelsets µij = E[yijk], the population mean of the ijth cell, so that

yijk = µij + eijk, i = 1, . . . , r, j = 1, . . . , c, k = 1, . . . , Nij . (7.4)

The Nij is the number of observations in cell ij. E.g., in (7.3), N11 = 3, N12 = 2, N13 = 4,etc. As in one-way ANOVA, the design is called balanced if the Nij are equal to, say, N .E.g., (7.2) is balanced, with N = 1 (hence we do not bother with the third subscript), but(7.3) is unbalanced.

Questions to ask include:

73

74 CHAPTER 7. TWO-WAY ANOVA

• Are there row effects? That is, are the means equal for the r rows?

• Are there column effects?

• Is there interaction? That is, are the row effects different for different columns, or viceversa?

More detailed questions will also be important, such as, “If there are differences, what arethey?”

The two basic models are additive, such as in (1.27), and nonadditive, as in (1.28).Additive means that there is no interaction between rows and columns, that is, the differ-ence between the means in any two rows is the same for each column. In the 3×3 case, thatrequirement is

µ11 − µ21 = µ12 − µ22 = µ13 − µ23;

µ11 − µ31 = µ12 − µ32 = µ13 − µ33;

µ21 − µ31 = µ22 − µ32 = µ23 − µ33; (7.5)

and in general,

µij − µi′j = µij′ − µi′j′ for all i, i′, j, j′, (7.6)

i.e.,

µij − µi′j − µij′ + µi′j′ = 0 for all i, i′, j, j′. (7.7)

The no interaction requirement (7.7) is equivalent to the restriction that the group meanssatisfy

µij = µ + αi + βj (7.8)

for some µ, αi’s and βj’s. The model (7.8) is called additive because the row effect andcolumn effect are added. The nonadditive model allows interaction, and is often called thesaturated model, because it makes no restrictions on the µij ’s. The interaction terms areγij, so that the means have the form

µij = µ + αi + βj + γij. (7.9)

As is, the formulation (7.9) is extremely overparametrized. E.g., one could get rid of µ andthe αi’s and βj’s, and still have the same model. Typically one places restrictions on theparameters so that the parameters are effects as in Section 6.3. At least in the balancedcase, the common definitions are

µ = µ·· ,

αi = µi· − µ·· ,

βj = µ·j − µ·· ,

γij = µij − µi· − µ·j + µ·· . (7.10)

7.1. THE SUBSPACES AND PROJECTIONS 75

One can check that these definitions imply the constraints

r∑

i=1

αi = 0,

c∑

j=1

βj = 0,

r∑

i=1

γij = 0 for each j,

c∑

j=1

γij = 0 for each i. (7.11)

In the unbalanced case, one may wish to use weighted means.

There are five basic models to consider, four of which are special cases of the additivemodel. They are

Saturated : µij = µ + αi + βj + γij ;Additive : µij = µ + αi + βj ;

Just row effect (no column effect) : µij = µ + αi ;Just column effect (no row effect) : µij = µ + βj ;

No row or column effect : µij = µ .

(7.12)

Notice that if the interaction terms are in the model, then both the row and column effectsare, too. The reason is that “no row effects” means that the row variable does not effectthe y, so that there cannot be any interactions. If there are, then the effect of the rows isdifferent for the different columns, meaning it cannot be uniformly 0. Thus µij = µ+βj +γij

is not a sensible model. (Of course, there may be rare situations when such a model makessense. I’ve never seen one.)

The next section looks at the subspaces corresponding to these models, without worryingabout constraints too much, and fitting and testing the models.

7.1 The subspaces and projections

We start by writing the saturated model as y = Xβ + e. We write out the vectors andmatrices in the 2×3 layout below. See (1.36) for another example. We put the observationsinto the vector y by going across the rows in (7.3). The matrices are then


X =

Grand mean︷︸︸︷1N11

1N12

1N13

1N21

1N22

1N23

Rows︷︸︸︷1N11

01N12

01N13

00 1N21

0 1N22

0 1N23

Columns︷︸︸︷1N11

0 00 1N12

00 0 1N13

1N210 0

0 1N220

0 0 1N23

Interactions︷︸︸︷1N11

0 0 0 0 00 1N12

0 0 0 00 0 1N13

0 0 00 0 0 1N21

0 00 0 0 0 1N22

00 0 0 0 0 1N23

=

Grand mean︷︸︸︷1N

Rows︷︸︸︷x

(R)1 x

(R)2

Columns︷︸︸︷x

(C)1 x

(C)2 x

(C)3

Interactions︷︸︸︷x

(I)11 x

(I)12 x

(I)13 x

(I)21 x

(I)22 x

(I)23

y =

y111

y112...

y11N11

y121

y122...

y12N12

y131

y132...

y13N13

y211

y212...

y21N21

y221

y222...

y22N22

y231

y232...

y23N23

, β =

µα1

α2

β1

β2

β3

γ11

γ12

γ13

γ21

γ22

γ23

. (7.13)

The x(R)i is the vector indicating observations in row i, the x

(C)j indicates row j, and x

(I)ij

indicates the ijth cell. The subspaces for the various models are next:

7.1. THE SUBSPACES AND PROJECTIONS 77

Saturated : MR×C = C(X) ;

Additive : MR+C = span{1n, x(R)1 , . . . , x(R)

r , x(C)1 , . . . , x(C)

c } ;

Just row effect : MR = span{1n, x(R)1 , . . . , x(R)

r } ;

Just column effect : MC = span{1n, x(C)1 , . . . , x(C)

c } ;No row or column effect : M∅ = span{1n} .

(7.14)

The R and C notations are self-explanatory. The R × C indicates that the row, columnand interaction effects are in the model, whereas R + C means just the row and columneffects, i.e., no interactions.

The projections and ranks for these models, except the additive model, are easy to obtainusing ideas from one-way ANOVA. The saturated model is really a large one-way ANOVAwith G = r × c groups, so the projection has elements yij· and the rank is r × c, at leastif all Nij > 0. The rank in general is the number of cells with Nij > 0. The model withjust columns effects is a one-way ANOVA with the rows as groups, hence the projection haselements yi·· and rank r. Similarly for the model with just column effects. The model withoutany row or column effects is also familiar, y∅ = y···1n, with rank 1. Here is a summary:

Model Subspace Projection yijk RankSaturated MR×C yij· r × cAdditive MR+C yi·· + y·j· − y···∗ r + c − 1

Just row effect MR yi·· rJust column effect MC y·j· c

No row or column effect M∅ y··· 1∗ Only for the balanced case

(7.15)

The results for the additive model needs some explanation. From (7.14) we see that

MR+C is the span of r + c + 1 vectors. It is easy to see that 1n = x(R)1 + · · ·+ x(R)

r , so thatwe can eliminate 1n from the set of spanning vectors. The remaining r + c vectors are stillnot linearly independent, because

(x(R)1 + · · ·+ x(R)

r ) − (x(C)1 + · · ·+ x(C)

c ) = 1n − 1n = 0n. (7.16)

Let us drop one of those, say x(C)c , and suppose that

z ≡ a1x(R)1 + · · ·+ arx

(R)r + ar+1x

(C)1 + · · · + ar+c−1x

(C)c−1 = 0n. (7.17)

In the 2 × 3 case of (7.14),

z =

(a1 + a3)1N11

(a1 + a4)1N12

a11N13

(a2 + a3)1N21

(a2 + a4)1N22

a21N23

. (7.18)


We now assume that Nij > 0 for all cells, i.e., there are no empty cells. If there are, therank can be a bit more challenging to figure out. Consider the elements in the Cth column,i.e., the ziCk’s. Because that column’s vector is missing, only the coefficient multiplyingthe appropriate row vector shows up, so that ziCk = ai. Thus if z = 0n, it must be thata1 = · · · = ar = 0. But then from (7.17) it is easy to see that the ar+1 = · · · = ar+c−1 = 0 aswell. Thus those vectors are linearly independent, showing that the rank of MR+C is indeedr + c − 1.

In the unbalanced case, the projection for the additive model has no simple form. Oneway to obtain the projection matrix is to use the X∗ = (x

(R)1 , . . . , x(R)

r , x(C)1 , . . . , x

(C)c−1), so

that MR+C = X∗(X∗′X∗)−1X∗′ .Turn to the balanced case, so that all Nij = N . The table (7.15) has that the projection

isy

R+C= −y···1n + y1··x

(R)1 + · · ·+ yr··x

(R)r + y·1·x

(C)1 + · · ·+ y·c·x

(C)c . (7.19)

That vector is clearly in MR+C , being a linear combination of the spanning vectors. We also

need that y − yR+C

⊥ MR+C . Start by looking at x(R)i :

(y − yR+C

)′x(R)i = y′x(R)

i + y···1′nx

(R)i − y1··x

(R)′

1 x(R)i − · · · − yr··x

(R)′

r x(R)i − y·1·x

(C)′

1 x(R)i

− · · · − y·c·x(C)′

c x(R)i . (7.20)

By inspecting (7.14) with Nij = N , one can see that

y′x(R)i =

c∑

j=1

N∑

k=1

yijk = yi··Nc

1′nx(R)i = Nc

x(R)′

i x(R)i = Nc

x(R)′

i x(R)j = 0 if i 6= j

x(C)′

j x(R)i = N, (7.21)

hence

(y − y(R+C)

)′x(R)i = yi··Nc + y···Nc − yi··Nc − y·1·N − · · · − y·c·N

= y···Nc − (y·1· + · · ·+ y·c·)N

= (r∑

i=1

c∑

j=1

N∑

k=1

yijk)Nc/(Ncr) − (r∑

i=1

N∑

k=1

yi1k/(Nr) + · · · +r∑

i=1

N∑

k=1

yick/(Nr))N

=r∑

i=1

c∑

j=1

N∑

k=1

yijk/r − (r∑

i=1

N∑

k=1

yi1k + · · ·+r∑

i=1

N∑

k=1

yick)/r

= 0. (7.22)

Likewise, y′(R+C)

x(R)j = 0 for each j. Thus y − y

R+C⊥ MR+C , showing that (7.19) is indeed

the projection.

7.2. TESTING HYPOTHESES 79

7.2 Testing hypotheses

There are many possible hypotheses testing problems based on nested subspaces among thosein (7.12), e.g., to test whether there are interactions, we would have

H0 : µ ∈ MR+C (additive model) vs. HA : µ ∈ MR×C (saturated model), (7.23)

or to test whether there are row effects in the additive model,

H0 : µ ∈ MC (just column effect, no row effect) vs. HA : µ ∈ MR+C (additive model).(7.24)

The picture below shows the various possible nestings, where the arrow points from thesmaller to the larger subspace, that is, MR+C → MR×C means that MR+C ⊂ MR×C .

MR×C� MR+C

MR��

��

MCH

HHHY

M∅�

��

HH

HHY

Notice that MR and MC are not nested, but every other pair is (one way or the other).That does not mean one cannot test MR versus MC, just that the approach we are takingwith F tests and ANOVA tables does not work.

Testing for interaction, testing for row effects, and testing for column effects, are themain goals. Testing for interaction is unambiguous, but there are several ways to set uptesting row or column effects. Here are some possible testing problems:

• Testing for interaction – The additive versus the saturated model:

H0 : µ ∈ MR+C versus HA : µ ∈ MR×C . (7.25)

• Testing for row effects:

– With the understanding that if there are interactions, there automatically are roweffects, one defines total row effects as either nonzero interactions, or if interactionsare zero, nonzero row effects. The null hypothesis is then MC , and the alternativeis the saturated model:

H0 : µ ∈ MC versus HA : µ ∈ MR×C . (7.26)

– One may have already decided that there are no interactions, but there maybe column effects. In that case, the testing problem is like in (7.26), but thealternative is the additive model:

H0 : µ ∈ MC versus HA : µ ∈ MR+C . (7.27)


– It may be that one has already decided there are no column effects, either, so thatthe problem is really a one-way ANOVA:

H0 : µ ∈ M∅ versus HA : µ ∈ MR. (7.28)

• Testing for column effects, analogous to testing for row effects:

– Testing total column effects

H0 : µ ∈ MR versus HA : µ ∈ MR×C . (7.29)

– Assuming there are no interactions:

H0 : µ ∈ MR versus HA : µ ∈ MR+C . (7.30)

– Assuming there are no row effects:

H0 : µ ∈ M∅ versus HA : µ ∈ MC . (7.31)

• Simultaneously testing for row and column effects:

– Testing total effects

H0 : µ ∈ M∅ versus HA : µ ∈ MR×C . (7.32)

– Assuming there are no interactions:

H0 : µ ∈ M∅ versus HA : µ ∈ MR+C . (7.33)

The ANOVA table for any of these testing problems is straightforward to calculate if allNij > 0 (see table (7.15)), although if it is unbalanced and involves the additive model, they

R+Cmay require a matrix inversion.

7.2.1 Example

The text (Table 7.4) repeats some data from Scheffe, The Analysis of Variance, problem 4.8,about the weights of female rats. The experiment had litters of rats born to one mother butraised by another. The row factor is the genotype of the litter, and the column factor is thegenotype of the foster mother. There were n = 61 litters, and the y’s are the average weightof the litters in grams at 28 days. The design is unbalanced. The tables below contains thecell means, yij·, and the Nij’s, respectively. See the book for the individual data points.


yij·Foster mother → A F I J yi·

Litter ↓A 63.680 52.400 54.125 48.960 55.112F 52.325 60.640 53.925 45.900 54.667I 47.100 64.367 51.600 49.433 52.907J 54.350 56.100 54.533 49.060 52.973y·j 55.400 58.700 53.362 48.680 y·· = 53.970

(7.34)

Nij

Foster mother → A F I J TotalLitter ↓

A 5 3 4 5 17F 4 5 4 2 15I 3 3 5 3 14J 4 3 3 5 15

Total 16 14 16 15 61

(7.35)

These two table give enough information to easily find the sums of squares for all butthe additive subspace. To find the projection onto the additive subspace MR+C , we need toset up an X matrix with linearly independent columns. There are a number of possibilities.We will place the restrictions on the parameters:

4∑

i=1

αi = 0 and4∑

j=1

βj = 0, (7.36)

which means that α4 = −α1 − α2 − α3 (and likewise for the βj’), giving

X = (161, x(R)1 −x

(R)4 , x

(R)2 −x

(R)4 , x

(R)3 −x

(R)4 , x

(C)1 −x

(C)4 , x

(C)2 −x

(C)4 , x

(C)3 −x

(C)4 ), (7.37)

and the β = (µ, α1, α2, α3, β1, β2, β3)′. Note that there are r + c − 1 = 7 columns, which is

the correct number. The next table gives the estimates and the projections for each ij:

Additive model y(R+C)ijk

Foster mother → A F I J αi

Litter ↓A 56.909 60.425 55.077 50.154 1.675F 54.884 58.400 53.052 48.129 −0.350I 54.255 57.771 52.423 47.501 −0.979J 54.888 58.404 53.056 48.133 −0.346

βj 1.268 4.784 −0.564 −5.487 µ = 53.966

(7.38)

The next table gives the sums of squares, and ranks, for the subspaces.


Subspace SS rankM∅ ‖y

0‖2 = 177681.7 1

MR ‖yR‖2 = 177741.8 4

MC ‖yC‖2 = 178453.3 4

MR+C ‖yR+C

‖2 = 178516.9 7

MR×C ‖yR×C

‖2 = 179341.0 16

Rn ‖y‖2 = 181781.8 61

(7.39)

Interaction. The first test is usually to see if there is any interaction. The next plotsillustrate the interactions. The left-hand plot has plots the yij·’s versus column number j,and the means for each row are connected. If there is exact additivity, then these lines shouldbe parallel. One of the lines looks particularly nonparallel to the others. The right-handplot has the fits for the additive model, the y(R+C)ij1 = µ + αi + βj’s. Because the modelenforces additivity, the lines are parallel.

1.0 2.0 3.0 4.0

5060

Column

Ave

rage

wei

ght

Raw data

1.0 2.0 3.0 4.0

4852

5660

Column

Ave

rage

wei

ght

Additive model

The ANOVA table for testing additivity, MR+C versus MR×C , is now easy. From table(7.39),

‖y−yR×C

‖2 = 181781.8−179341.0 = 2440.8, and ‖y−yR+C

‖2 = 181781.8−178516.9 = 3264.9.

(7.40)The degrees of freedom for those two sums of squares are, respectively, n−rc and n−(r+c−1),hence the ANOVA table is

Source Sum of squares Degrees of freedom Mean square FInteractions 824.1 9 91.567 1.689Error 2440.8 45 54.240 —Total 3264.9 54 — —


The F9,45,0.05 = 2.096, so that the interactions are not significantly different from 0. Thus,though the plot may suggest interactions, according to this test they are not statisticallysignificant, so that there is not enough evidence to reject additivity.

Note that the degrees of freedom for interaction are n−rc−(n−(r+c−1)) = (r−1)×(c−1).

Testing for row effects. Presuming no interactions, we have two choices for testing whetherthere are row effects: Assuming there are no column effects, or allowing column effects. Wewill do both.

First, let us allow column effects, so that M0 = MC and MA = MR+C . Then ‖y −y

R+C‖2 = 3264.9 from (7.40), and

‖y − yC‖2 = 181781.8− 178453.3 = 3328.5. (7.41)

The ANOVA table:

Source Sum of squares Degrees of freedom Mean square FRows 63.6 3 21.200 0.531Error 3264.9 54 60.461 —Total 3328.5 57 — —

The rows are not at all significant.Now suppose there are no column effects, so that M0 = M∅ and MA = MR. This time,

‖y−yR‖2 = 181781.8−177741.8 = 4040.0, ‖y−y∅‖

2 = 181781.8−177681.7 = 4100.1, (7.42)

and

Source Sum of squares Degrees of freedom Mean square FRows 60.1 3 20.033 0.283Error 4040.0 57 70.877 —Total 4100.1 60 — —

Again, the conclusion is no evidence of row effect, that is, effect of the litters’ genotypes.Note that though the conclusion is the same, the calculations are different depending onwhat one assumes about the column effects.

Testing for column effects. From above, it seems perfectly reasonable to assume no roweffects, so we will do just that. Then to test for column effects, we have M0 = M∅ andMA = MC . Equations (7.41) and (7.42) already have the necessary calculations, so theANOVA table is

Source Sum of squares Degrees of freedom Mean square FColumns 771.6 3 257.2 3.629Error 3328.5 57 70.877 —Total 4100.1 60 — —

Now F3,57,0.05 = 2.766, so there does appear a significant column effect, that is, the fostermother’s genotype does appear to affect the weights of the litters.

Taking everything together, it looks like the best model has just the column effects,yijk = µ + βj + eijk.


7.3 The balanced case

The example in the previous section required specifying whether or not there are row effectsbefore knowing how to test for column effects, and vice versa. That is,

‖yR+C

− yC‖2 6= ‖y

R− y∅‖

2 and ‖yR+C

− yR‖2 6= ‖y

C− y∅‖

2. (7.43)

It turns out that in the balanced case, the row sums of squares are the same whetheror not the column effects are in the model, so that there are equalities in (7.43). The mainreason is the next lemma.

Lemma 3 In the balanced case, MR·∅ ⊥ MC·∅.

Proof. Elements x ∈ MR·∅ are in MR, which means xijk = ai, and are orthogonal to M∅,which means that the elements sum to zero. Because the design is balanced,

r∑

i=1

c∑

j=1

N∑

k=1

xijk =r∑

i=1

c∑

j=1

N∑

k=1

ai = (cN)r∑

i=1

ai, (7.44)

hencer∑

i=1

c∑

j=1

N∑

k=1

xijk =⇒r∑

i=1

ai = 0. (7.45)

Similarly, if z ∈ MC·∅, zijk = bj , and

c∑

j=1

bj = 0. (7.46)

Then

x′z =r∑

i=1

c∑

j=1

N∑

k=1

xijkzijk =r∑

i=1

c∑

j=1

N∑

k=1

aibj = N(r∑

i=1

ai)(c∑

j=1

bj) = 0. (7.47)

Thus x ⊥ z, i.e., MR·∅ ⊥ MC·∅. 2

Now the main result shows that the projection onto the additive-effect space can bedecomposed into the projections onto the row- and column-effect spaces.

Proposition 18 In the balanced case,

y(R+C)·∅ = y

R·∅ + yC·∅. (7.48)

“Proof.” The proof is basically the same as for the decomposition (6.11) of the between sum ofsquares. That is, one has to show that y

R·∅+yC·∅ ∈ M(R+C)·0 and y−(y

R·∅+yC·∅) ⊥ M(R+C)·0.

The proof of these will be left to the reader. (Or more accurately, to the homework-doers.) 2

7.3. THE BALANCED CASE 85

Even in the unbalanced case, y(R+C)·∅ can be written as the sum of something in MR·∅

plus something in MC·∅. It is just that in the unbalanced case, those two somethings mightnot be the projections on their respective spaces.

Now we can expand on the decomposition (7.48). Clearly,

y − y∅ = (y − yR×C

) + (yR×C

− yR+C

) + (yR+C

− y∅), (7.49)

but then (7.48) yields

y − y∅ = (y − yR×C

) + (yR×C

− yR+C

) + (yR− y∅) + (y

C− y∅). (7.50)

The four terms on the right-hand side are projections onto, respectively, M⊥R×C , M(R×C)·(R+C),

MR·∅ and MC·∅. It is always true (in balanced or unbalanced designs) that the first one isorthogonal to the other three, and the second is orthogonal to the last two. In the balancedcase, the last two are also orthogonal (Lemma 3), so that the sums of squares add up:

‖y − y∅‖2 = ‖y − yR×C

‖2 + ‖yR×C

− yR+C

‖2 + ‖yR− y∅‖2 + ‖y

C− y∅‖2;

Total SS = Error SS + Interaction SS + Row SS + Column SS.(7.51)

The balance also allows easy calculation of the sums of squares using the estimated effects(7.10):

‖y − yR×C

‖2 =∑r

i=1

∑cj=1

∑Nk=1(yijk − yij·)

2

‖yR×C

− yR+C

‖2 =∑r

i=1

∑cj=1

∑Nk=1(yij· − (yi·· + y·j· − y···))

2 = N∑r

i=1

∑cj=1 γ2

ij

‖yR− y∅‖2 =

∑ri=1

∑cj=1

∑Nk=1(yi·· − y···)

2 = Nc∑r

i=1 α2i

‖yC− y∅‖2 =

∑ri=1

∑cj=1

∑Nk=1(y·j· − y···)

2 = Nr∑c

j=1 β2j

(7.52)Another benefit is that the sum of squares for testing row effects is the same whether or

not the column effects are included in the model. Starting with the Row SS in the presenceof column effects, we have

‖yR+C

− yC‖2 = ‖(y

R+C− y∅) − (y

C− y∅)‖

2 (7.53)

= ‖(yR− y∅ + y

C− y∅) − (y

C− y∅)‖

2 (7.54)

= ‖yR− y∅‖

2, (7.55)

which is the Row SS without the column effects.The decomposition (7.51) leads to the expanded ANOVA table

Source SS df MS FRows Nc

∑ri=1 α2

i r − 1 SSR/dfR MSR/MSE

Columns Nr∑c

j=1 β2j c − 1 SSC/dfC MSC/MSE

Interactions N∑r

i=1

∑cj=1 γ2

ij (r − 1)(c − 1) SSI/dfI MSI/MSEError

∑ri=1

∑cj=1

∑Nk=1(yijk − yij·)

2 n − rc SSE/dfE —

Total∑r

i=1

∑cj=1

∑Nk=1(yijk − y···)

2 n − 1 — —

It is then easy to perform the F tests.


7.3.1 Example

Below are data on 30 hyperactive boys, measuring “out-of-seat behavior.” Each boy was givenone of two types of therapy, Behavioral or Cognitive, and Ritalin in one of three dosages,Low, Medium, or High. The design is balanced, so that here are N = 5 boys receiving eachtherapy/dose combination.

Dose → L M HBehavioral 54 56 53 57 55 51 56 53 55 55 53 55 56 52 54Cognitive 52 50 53 51 54 54 57 58 56 53 58 57 55 61 59

(7.56)

[From http://espse.ed.psu.edu/statistics/Chapters/Chapter12/.]The various means:

yij· yi··55.000 54.000 54.000 54.33352.000 55.600 58.000 55.200

y·j· 53.500 54.800 56.000 y··· = 54.767

(7.57)

Plugging those numbers into (7.52) yields

Source SS df MS FTherapy (Rows) 5.64 1 5.640 1.627Dosage (Columns) 31.27 2 15.635 4.508Interactions 63.26 2 31.630 9.123Error 83.20 24 3.467 —Total 183.37 29 — —

Now F2,24,0.001 = 9.339, so that the interaction F is almost significant at the α = 0.001 level,and certainly at the 0.005 level, meaning there is interaction. Thus there are also row andcolumn effects. The F1,24,0.05 = 4.260, which means it looks as though the row effects arenot significant, but it really means the row effect averaged over columns is not significant.Consider the difference between the therapies for each dosage. The estimate of the differencefor column j, µ1j − µ2j, is y1j· − y2j·, which has variance (2/5)σ2

e , because it is a difference

of two independent means of 5 observations. Hence se =√

(2/5) × 3.467 = 1.178, and

µ11 − µ21 = 55 − 52 = 3.0, t = 2.55µ12 − µ22 = 54 − 55.6 = −1.6, t = −1.36µ13 − µ23 = 54 − 58 = −4.0, t = −3.40

(7.58)

Now we see that for low dosage, cognitive therapy has a significantly lower score than behav-ioral therapy, while for high dosage, the behavioral therapy has a significantly lower scorethan cognitive. (The difference is not significant for the medium dosage.) These differencesare canceled when averaging, so it looks like therapy has no effect.

Another way to look at the data is to notice that for behavioral therapy, Ritalin dosageseems to have no effect, whereas for cognitive therapy, the scores increase with dosage.

7.4. BALANCE IS GOOD 87

7.4 Balance is good

Balanced designs are preferable to unbalanced ones for a number of reasons:

1. Interpretation is easier. In balanced two-way designs, the row and column effectsare not confounded. That is, the row effects are the same whether or not the columneffects are in the model, and vice versa. In unbalanced designs, there is confounding.E.g., suppose these are the cell means and Nij ’s for a small ANOVA:

yij· Nij

Fertilizer ↓; Soil → Good BadA 115.3 52.7B 114.0 52.8

Fertilizer ↓; Soil → Good BadA 20 5B 5 10

If yijk measures yield of corn, then you would expect a strong column effect, with thefirst column having better yield than the second. With those data, y·1· = 115.04 andy·2· = 52.77. Now if you ignored the columns, and just tested for row effect, you seethat Fertilizer A is better than Fertilizer B: y1·· = 102.78 and y2·· = 73.2. Would thatbe because A is better, or because A happens to have more observations in the goodsoil? That question arises because quality of the soil and fertilizer are confounded. Wecan see that for each type of soil, A and B are about equal. If all Nij = 10, then therewould not be such confounding, and the two fertilizers would look essentially equal:y1·· = 84 and y2·· = 83.4.

2. Computations are easier. With a balanced design, no matrix inversions are neces-sary. Everything is based on the various means.

3. The design is more efficient. By efficiency we mean that the variances of theestimates tend to be low for a given number of observations. That is, consider theNij ’s above, where the total n = 40. Then the overall variance of the estimates of thecell means is

V ar(y11·)+V ar(y12·)+V ar(y21·)+V ar(y22·) = σ2e(

1

20+

1

5+

1

5+

1

10) = (0.55)σ2

e . (7.59)

If, on the other hand, all Nij = 10, which still gives n = 40,

V ar(y11·) + V ar(y12·) + V ar(y21·) + V ar(y22·) = σ2e(

1

10+

1

10+

1

10+

1

10) = (0.40)σ2

e .

(7.60)That is, the overall variance in the balanced case is 27% smaller than in the unbalancedcase, with the same number of observations.

The above benefits are true in general in statistics: balance is good.


Chapter 8

Multiple comparisons

When you preform a hypothesis test, or find a confidence interval, you often try to control thechance of an error. In the testing situation, the Type I error is the chance you reject the nullhypothesis when the null hypothesis is true, that is, it is the chance of a false positive. Oneoften wants that chance to be no larger than a specified level α, e.g., 5%. With confidenceintervals, you wish that the chance the parameter of interest is covered by the interval to beat least 1 − α, e.g., 95%.

Suppose there are several tests or confidence intervals begin considered simultaneously.For example, in the leprosy example, one may want confidence intervals for γ1 = (α1 +α2)/2 − α3 and γ2 = α1 − α2. If each confidence interval has a 95% chance of coveringits true value, what is the chance that both cover their true values simultaneously? It issomewhere between 90% and 95%. Thus we can at best guarantee that there is a 90% chanceboth are correct. The difference between 90% and 95% may not be a big deal, but considermore than two intervals, say 5, or 10, or 20. The chance that ten 95% intervals are all correctis bounded from below by 50%: For twenty, the bound is 0%!

To adjust the intervals so that the chance is 95% (or whatever is desired) that all arecorrect, one must widen them. Suppose there are J intervals, and αO is the overall error rate,that is, we want the chance that all intervals cover their parameters to be at least 1 − αO.Then instead of using γi ± tν,α/2se(γi), the intervals would be

γi ± CαOse(γi), (8.1)

where the CαOis some constant that is larger than the tν,α. This constant must satisfy

P [γi ∈ (γi ± CαOse(γi)) , i = 1, . . . , J ] ≥ 1 − αO. (8.2)

There are many methods for choosing the CαO, some depending on the type of parameters

being considered. We will consider three methods:

• Bonferroni, for J not too large;

• Tukey, for all pairwise comparisons, i.e., all µi − µj ’s;

• Scheffe, for all contrasts of group means (so that J = ∞).

89

90 CHAPTER 8. MULTIPLE COMPARISONS

8.1 Bonferroni Bounds

Bonferroni bounds are the simplest and most generally applicable, although they can be quiteconservative. The idea is that if each interval has individually a chance of 1− α of coveringits parameter, then the chance all J simultaneously cover their parameters is 1 − Jα. Tosee this fact, let Ai be the event that the ith interval is ok, and αi be its individual coverageprobability. That is,

Ai is true iff γi ∈ (γi ± tν,αi/2se(γi)); P [Ai] = 1 − αi. (8.3)

Then if 1 − αO is the desired chance all intervals cover their parameters, we need that

P [A1 ∩ A2 ∩ · · · ∩ AJ ] ≥ 1 − αO. (8.4)

Looking at complements instead, we have that P [Aci ] = αi, and wish to have

P [Ac1 ∪ Ac

2 ∪ · · · ∪ AcJ ] ≤ αO. (8.5)

The probability of a union is less than or equal to the sum of probabilities, hence

P [Ac1 ∪ Ac

2 ∪ · · · ∪ AcJ ] ≤ P [Ac

1] + P [Ac2] + · · ·+ P [Ac

J ] = α1 + · · ·+ αJ . (8.6)

So how can we choose the αi’s so that

α1 + · · ·+ αJ ≤ αO? (8.7)

Just take each αi = αO/J . (Of course, there are many other choices that will work, too.) Inany case, we have proved the next proposition:

Proposition 19 If each of J confidence intervals has a coverage probability of 1 − αO/J ,then the chance that all intervals cover their parameters is at least 1 − αO.

In the leprosy example, with the covariate, we have from (4.62) that the 95% confidenceinterval for (α1 + α2)/2 − α3 is

(−3.392 ± 2.056 × 1.641) = (−6.77,−0.02), (8.8)

where the t26,0.025 = 2.056. For α1 − α2, it is

(−1.09 ± 2.056 × 1.796) = (−4.78, 2.60). (8.9)

Now if we want simultaneous coverage to be 95%, we would take the individual αi = 0.05/2 =0.025, so that the t in the intervals would be

t26,0.025/2 = t26,0.0125 = 2.379, (8.10)

8.2. TUKEY’S METHOD FOR COMPARING MEANS PAIRWISE 91

which is a bit larger than the 2.056. The intervals are then

(α1 + α2)/2 − α3 : (−3.392 ± 2.379 × 1.641) = (−7.30, 0.51);α1 − α2 : (−1.09 ± 2.379 × 1.796) = (−5.36, 3.18).

(8.11)

Notice that these are wider, and in fact now the first one includes zero, suggesting that thefirst contrast is not significant. Recall the F test for the treatment effects around (5.51). Wefound the effect not significant, but the Drugs versus Placebo contrast was significant. Partof the disconnect was that the F test takes into account all possible differences. When weadjust for the fact that we are considering two contrasts, neither contrast is not significant,agreeing with the F test.

The big advantage of the Bonferroni approach is that there are no special assumptionson the relationships between the intervals. They can be any type (t, z, etc.); they canbe from the same experiment or different experiments; they can be on any combination ofparameters. They main drawback is the conservativeness. That is, because there is a “≥”in (8.4), one is likely to be understating the overall coverage probability, maybe by a greatdeal if J is large. More accurate bounds in special cases lead to smaller intervals (which isgood), without violating the coverage probability bound. The next two sections deal withsuch cases.

8.1.1 Testing

Testing works the same way. That is, if one has J hypothesis tests and wishes to have anoverall Type I error rate of αO, then each individual test is performed at the αO/J level. Thisresult can be shown using (8.6) again, where now Ai is the event that the ith test accepts thenull hypothesis (so that Ac

i is that it rejects), and the probabilities are calculated assumingall null hypotheses are true.

To illustrate on the example above, in order to achieve an overall αO = 0.05, we useindividual t-tests of level αO/2 = 0.025. That is, we reject the ith null hypothesis, H0 : γi = 0,if

|γi|se(γi)

≥ t26,0.025/2 = 2.379. (8.12)

The two statistics are | − 3.392|/1.641 = 2.067 and | − 1.09|/1.796 = 0.607, neither of whichare significant.

8.2 Tukey’s method for comparing means pairwise

In the balanced one-way ANOVA, one may wish to make all possible pairwise comparisonsbetween the means, that is, test all the null hypotheses H0 : µi = µj, or equivalently, theH0 : αi = αj’s. Or, finding the confidence intervals for all the αi − αj ’s. In this case, wehave that

αi − αj = yi· − yj· and se( αi − αj) = σe

√2/N. (8.13)


Assuming that there are g means to compare, we have J =(

g2

)confidence intervals.

The goal here is to find the C so that

P [µi − µj ∈ (yi· − yj· ± CαOσe

√2/N) for all 1 ≤ i < j ≤ g] ≥ 1 − αO. (8.14)

First,

µi − µj ∈ (yi· − yj· ± CαOσe

√2/N) ⇐⇒ |(yi· − yj·) − (µi − µj)| ≤ CαO

σe

√2/N

⇐⇒ |(yi· − yj·) − (µi − µj)|σe/

√N

≤ CαO

√2

⇐⇒ |(yi· − µi) − (yj· − µj)|σe/

√N

≤ CαO

√2. (8.15)

Next,

|(yi· − µi) − (yj· − µj)|σe/

√N

≤ CαO

√2 for all 1 ≤ i < j ≤ g ⇐⇒

T ≡ max1≤i≤g(yi· − µi) − min1≤i≤g(yi· − µi)

σe/√

N≤ CαO

√2. (8.16)

This equation follows from the fact that all the elements in a set are less than or equal to agiven number if and only if the largest is. The distribution of this T is defined next.

Definition 21 Suppose Z1, . . . , Zg are independent N(0, 1)’s, and U ∼ χ2ν, where U is in-

dependent of the Zi’s. Then

T =max1≤i≤g{Zi} − min1≤i≤g{Zi}√

U/ν(8.17)

has the studentized range distribution, with parameters g and ν. It is denoted

T ∼ Tg,ν . (8.18)

To apply this definition to T in (8.16), set Zi =√

N(yi· − µi)/σe’s, so that they areindependent N(0, 1)’s, and set νσ2

e/σ2e = U which is independent of the Zi’s and distributed

χ2ν , where here ν = n − g. Then from (8.16)

T =max1≤i≤g

√N(yi· − µi)/σe − min1≤i≤g

√N(yi· − µi)/σe√

(νσ2e/σ

2e)/ν

=max1≤i≤g{Zi} − min1≤i≤g{Zi}√

U/ν

∼ Tg,ν . (8.19)

8.2. TUKEY’S METHOD FOR COMPARING MEANS PAIRWISE 93

Letting Tg,ν,αObe the upper αth

O cutoff point of the Tg,ν distribution, the confidenceintervals for the difference µi − µj is

yi· − yj· ± Tg,ν,αOσe/

√N, (8.20)

and the chance that all intervals cover their difference is 1 − αO.

Note. Be careful about the√

2. Comparing (8.14) to (8.20), we see that the constantCαO

= Tg,ν,αO/√

2, so that the intervals are not ±Tg,ν,αO× se, but ±Tg,ν,αO

× se/√

2.

Example. Look again at the leprosy data, without the covariates in this case. The g = 3,y1· = 5.3, y2· = 6.1, y3· = 12.3, σ2

e = 36.86, ν = 27, and N = 10. Then T3,27,0.05 = 3.506.(Tables for the studentized range are not as easy to find as for the usual t’s and F ’s, butthere are some online, e.g., http://cse.niaes.affrc.go.jp/miwa/probcalc/s-range/, oryou can use the functions ptukey and qtukey in R.)

The intervals are then

(yi· − yj· ± 3.506 ×√

36.86

10) = (yi· − yj· ± 6.731). (8.21)

The intervals are in the “Tukey” column:

Means Tukey Bonferroniµ1 − µ2 (-7.531, 5.931) (-7.729, 6.129)µ1 − µ3 (-13.731, -0.269) (-13.929, -0.071)µ2 − µ3 (-12.931, 0.531) (-13.129, 0.729)

We see that the second interval does not contain 0, so that Drug A can be declared betterthan the placebo here. The other two intervals contain 0.

For comparison, we also present the Bonferroni intervals, where here J =(

32

)= 3. They

use t27,0.05/6 = 2.552, so that

(yi· − yj· ± 2.552 ×√

2 × 36.86

10) = (yi· − yj· ± 6.929). (8.22)

These intervals are slightly wider than the Tukey intervals, because Bonferroni is conservativewhereas Tukey is exact here.

In the unbalanced case, it is hard to obtain exact intervals, but conservative Tukey-typeintervals (which are less conservative than Bonferroni) use (Ni + Nj)/2 in place of N . Thatis, the intervals are

yi· − yj· ± Tg,ν,αOσe/

√(Ni + Nj)/2. (8.23)


8.3 Scheffe’s method for all contrasts

What looks like a particularly difficult problem is to find a CαOso that the chance that

confidence intervals for all possible contrasts contain their parameters is at least 1 − αO.Let α = (α1, . . . , αg)

′, so that a contrast of the αi’s is c′α, where c = (c1, . . . , cg)′ and

c1 + · · · + cg = 0. Note that any contrast of the αi’s is the same contrast of the µi’s. Theestimate of c′α is

c′α = c1y1· + · · · + cgyg·. (8.24)

We wish to find the constant CαOso that

P [c′α ∈ (c′α ± CαOse(c′α)) for all contrasts c] ≥ 1 − αO. (8.25)

Because the variance of each mean is σ2e/Ni, and the means are independent, the estimated

standard error is

se(c′α) = σe

√c21/N1 + · · · + c2

g/Ng. (8.26)

It turns out to be useful to express the contrast using the a for which c′α = a′y andc′α = c′µ, that is,

a =

(c1/N1)1N1

(c2/N2)1N2

...(cg/Ng)1Ng

, (8.27)

so that se(c′α) = σe‖a‖. Notice that a ∈ MA, and

1′na = N1(c1/N1) + · · ·+ Ng(cg/Ng) = c1 + · · ·+ cg = 0, (8.28)

so that a ⊥ M∅. Which means that a ∈ MA·∅. Note also the reverse, that is, for anya ∈ MA·∅, there is a corresponding c: If a = (a11

′N1

, . . . , ag1′Ng

)′, then c = (N1a1, . . . , Ngag)′,

which sums to 0 because a′1n = 0. Thus the probability inequality in (8.25) is the same as

P [a′µ ∈ (a′y ± CαOσe‖a‖) for all a ∈ MA·∅] ≥ 1 − αO. (8.29)

Similar to the reasoning in (8.15) and (8.16), we have that

a′µ ∈ (a′y ± CαOσe‖a‖) for all a ∈ MA·∅ ⇐⇒ (a′(y − µ))2

σ2e‖a‖2

≤ C2αO

for all a ∈ MA·∅

⇐⇒ maxa∈MA·∅

(a′(y − µ))2

σ2e‖a‖2

≤ C2αO

(8.30)

Because a ∈ MA·∅, MA·∅a = a, hence

a′(y − µ) = a′MA·∅(y − µ) = a′(yA·∅ − MA·∅µ). (8.31)

8.3. SCHEFFE’S METHOD FOR ALL CONTRASTS 95

The Cauchy-Schwarz Inequality says that (a′b)2 ≤ ‖a‖2‖b‖2, hence

a′(yA·∅ − MA·∅µ)2

‖a‖2≤

‖a‖2‖yA·∅ − MA·∅µ‖2

‖a‖2= ‖y

A·∅ − MA·∅µ‖2. (8.32)

Taking a = ‖yA·∅ − MA·∅µ‖2(∈ MA·∅), the inequality in (8.32) is an equality. Thus,

maxa∈MA·∅

(a′(y − µ))2

‖a‖2= max

a∈MA·∅

(a′(yA·∅ − MA·∅µ))2

‖a‖2= ‖y

A·∅ − MA·∅µ‖2. (8.33)

Because yA·∅ − MA·∅µ ∼ Nn(0n, σ

2eMA·∅), and trace(MA·∅) = g − 1,

‖yA·∅ − MA·∅µ‖2 ∼ σ2

eχ2g−1. (8.34)

Now σ2e is independent of y

A·∅, and is distributed (σ2e/(n − g))χ2

n−g, so that

maxa∈MA·∅

(a′(y − µ))2

σ2e‖a‖2

=σ2

eχ2g−1

(σ2e/(n − g))χ2

n−g

=χ2

g−1

χ2n−g/(n − g)

= (g − 1)χ2

g−1/(g − 1)

χ2n−g/(n − g)

∼ (g − 1)Fg−1.n−g. (8.35)

Looking at (8.30) and (8.35), we see that

P [a′µ ∈ (a′y ± CαOσe‖a‖) for all a ∈ MA·∅] = P [(g − 1)Fg−1,n−g ≤ C2

αO], (8.36)

which means that

CαO=√

(g − 1)Fg−1,n−g,αO. (8.37)

Example. Consider the four contrasts for the leprosy data: the Drugs vs. Placebo, plus thethree pairwise comparisons. For αO = 0.05, F2,27,0.05 = 3.354, so that CαO

=√

2 × 3.354 =2.590. The table has the γ ± 2.590 × se(γ)’s:

Contrast Estimate se Scheffe Bonferroni(α1 + α2)/2 − α3 -6.6 2.351 (-12.689, -0.511) (-12.891, -0.309)

α1 − α2 -0.8 2.715 (-7.832, 6.232) (-8.065, 6.465)α1 − α3 -7.0 2.715 (-14.032, 0.032) (-14.265, 0.265)α2 − α3 -6.2 2.715 (-13.232, 0.832) (-13.465, 1.065)


The only significant contrast is the first. For comparison, we also have the Bonferroniintervals, which are a little wider. The t27,α/8 = 2.676. Note that the Tukey intervals are notdirectly applicable here, because the first contrast is not a paired comparison.

A nice feature of the Scheffe intervals is that we can add any number of contrasts toour list without widening the intervals. With Bonferroni, each additional contrast wouldwiden all of them. Of course, if one is only interested in a few contrasts, Scheffe can be quiteconservative, since it protects against all. Bonferroni is conservative, too, but may be more orless so. The following table gives an idea in this example (with αO = 0.05, g = 3, n−g = 27):

# of contrasts Scheffe cutoff Bonferroni cutoff1 2.59 2.0522 2.59 2.3733 2.59 2.5524 2.59 2.6765 2.59 2.77110 2.59 3.05720 2.59 3.333100 2.59 3.954

Between 3 and 4 the advantage switches from Bonferroni to Scheffe. With only threemeans, one is not likely to want to look at more than 5 or so contrasts, so either approachis reasonable here.

8.3.1 Generalized Scheffe

The key to the derivation of the Scheffe bounds was the appearance of the vector spaceMA·∅. Similar bounds can be obtained for any set of linear combinations of the parametersas long as one can find an associated vector space. The model is the general one, y = Xβ+e,e ∼ Nn(0n, σ

2eIn). Suppose one is interested in the set of linear combinations

{λ′β | λ ∈ Λ}, (8.38)

where Λ is any set of p × 1 vectors (presuming β is p × 1). For each λ, let aλ be the vectorso that a′

λy is the least squares estimate of λ′β. Then define the vector space

MΛ = span{aλ | λ ∈ Λ}. (8.39)

Recall that the least squares estimates have aλ ∈ M(= C(X)), hence MΛ ⊂ M. The Scheffeintervals are then

λ′β ±√

q Fq,ν,αOse(λ′β), q = rank(MΛ), ν = rank(M⊥) = n − p. (8.40)

For example, in the two-way additive ANOVA model, if one wishes all contrasts on therow effects αi, then q = r − 1 and ν = n − r − c + 1.

8.4. CONCLUSION 97

For another example, consider the leprosy data with covariate, Section 3.4.2. We areagain interested in all contrasts of the αi’s, but with the covariate included in the model,the estimates are not based on just the group means. Now the ν = 26, and σ2

e = 16.05.The dimension of the MΛ will be g − 1 = 3 again. To see this, note that any contrastvector (c1, c2, c3)

′ has associated λ = (0, c1, c2, c3, 0)′, since β = (µ, α1, α2, α3, γ)′, which canbe written as a linear combination of two special λ’s,

(0, c1, c2, c3, 0)′ = c1(0, 1, 0,−1, 0)′ + c2(0, 0, 1,−1, 0)′, (8.41)

because c3 = −c1 − c2. Then

MΛ = span{a(0,1,0,−1,0)′ , a(0,0,1,−1,0)′}. (8.42)

Those two vectors are linearly independent, so rank(MΛ) = 2. The CαO=√

2F2,26,0.05 =2.596. Just for a change, instead of confidence intervals, the next table has the t-statistics,where the contrast is significant if the |t| exceeds 2.596.

Contrast Estimate se t=estimate/se(α1 + α2)/2 − α3 -3.392 1.641 -2.067

α1 − α2 -0.109 1.796 -0.061α1 − α3 -3.446 1.887 -1.826α2 − α3 -3.337 1.854 -1.800

Now none of the contrasts is significant. Actually, we could have foreseen this fact, sincethe F -test for testing the αi’s are equal was not significant. For the Scheffe intervals, noneare significant if and only if the F test is not significant.

8.4 Conclusion

Which of these methods to use, or whether to use other methods floating around, dependson the situation. Bonferroni is the most widely useful, and although is conservative, as longas J is not too large, is reasonable. Tukey’s is the best if you are comparing means pairwise.Scheffe is more general than Tukey, but is usually conservative because one rarely wants tolook at all contrasts. Whether Bonferroni or Scheffe is less conservative depends on howmany contrasts one is interested in. Fortunately, it is not too hard to do both, and pick theone with the smallest CαO

.There are other approaches to the multiple comparison problem than trying to control the

overall error rate. One that has arisen recently is called the False Discovery Rate (FDR),which may be relevant when testing many (like 100) similar hypotheses. For example, onemay be testing many different compounds as carcinogens, or testing many different genes tosee if they are associated with some disease. In these cases, trying to control the chance ofany false positives may mean one does not see any significant differences. Of course, you can


use a fairly hefty αO, like 50%, or just not worry about the overall level, if you are willingto put up with a 5% of false positives.

The FDR approach is slightly different, wanting to control the number of false positivesamong the positives, that is,

FDR = E

[# Rejections when null is true

Total # of rejections

]. (8.43)

The idea is that if the FDR ≤ 0.05, then among all your rejections, approximately only 5%were falsely rejected. In the examples, you are willing to put up with false rejections, as longas the rejections are predominantly correct. So, e.g., if you are testing 100 compounds, and30 are found suspect, then you are fairly confident that only 1 or 2 (30 × 0.05 = 1.5) havebeen falsely accused. There is some controversy about what you do with the ratio whenthere are 0 rejections, and people are still trying to find good ways to implement proceduresto control FDR, but it is an interesting idea to be aware of.

Chapter 9

Random effects

So far, we have been concerned with fixed effects, even though that term has yet tobe uttered. By fixed, we mean the actual levels in the row or columns are of interest inthemselves: The types of fruit trees, the treatments for leprosy, the genotypes of rats. Weare interested in the actual Drug A or Drug D, or the Shmouti Orange type of tree. Anexample where an effect is not of direct interest is in HW #7:

Here are data from an experiment in which 3 people tested four formulations of hot dogs.Each person tested three of each formulation. The people then rated each hot dog from 0to 14 on its texture, where 0 is soft, 14 is hard. (From course notes of J. W. Sutherland athttp://www.me.mtu.edu/∼jwsuther/doe/)

Formulation → A B C DPerson 1 7.6 6.5 7.2 11.4 7.6 9.5 6.3 7.9 6.8 2.7 3.1 1.7Person 2 7.2 13.6 10.7 12.9 12.4 10.7 11.1 6.9 9.0 3.3 1.9 2.3Person 3 7.0 10.2 8.3 10.2 8.1 8.7 6.8 9.2 11 3.7 2.2 3.2

(9.1)

The formulations of hot dogs are fixed, because you can imagine being interested inthose actual effects. On the other hand, the people effects are probably not of interest inthemselves, but rather as representatives of a population. That is, someone in another partof the country might use the same formulations for their hot dogs, but are unlikely to usethe same people. For modeling the data, the people would be considered a random samplefrom a population, so would be considered random effects.

Agricultural experiments often have random effects represented by plots of land. (Or, e.g.,litters of pigs.) An experiment comparing three fertilizers may be conducted by splitting theexperimental land into ten plots, and applying each fertilizer to a subplot of each plot. Thisdesign is a two-way ANOVA, with plots as rows and fertilizers as columns. The fertilizers arefixed effects, because the same fertilizers would be of interest elsewhere. The plots would berandom effects, thought of as a sample of all possible farm plots, because the outside worlddoes not really care about those particular plots.

One person’s fixed effect might be another’s random effects, so if the plots are yours, andyou are going to use them over and over, you very may well consider the plots as fixed effects.

99

100 CHAPTER 9. RANDOM EFFECTS

In practice, whether to consider an effect fixed or random depends on whether you wish tomake inferences about the actual effects, or about the population which they represent.

The next several sections look at some models with random effects: one random effect,two random effects, or one random and one fixed effect.

9.1 One way random effects

Consider a population I of individuals. E.g., suppose I is the student population of a largemidwestern university. Think of I as a list of the students’ id numbers. Of interest inµ, the average lung capacity of the population. The typical approach would be to take arandom sample, of twenty-five, say, from the population, measure the lung capacity of thesampled people, and average the measurements. There is likely to be measurement error,that is, the results for person I would not be exactly that person’s lung capacity, but thecapacity ± some error. To help better estimate each person’s capacity, several, say five,independent measurements are made on each individual. The data would then look like aone-way ANOVA, with g = 25 groups (people) and N = 5 observations per group. Let µI bethe true lung capacity of individual I, and YIj, j = 1, . . . , N be the measurements for thatindividual, so that the model is

yIj = µI + eIj , j = 1, . . . , N. (9.2)

The “I” is capitalized to suggest that it is random, hence that µI is random. That is,as I runs over the population of id numbers, µI runs over the true lung capacities of thestudents. We will model the distribution by

µI ∼ N(µ, σ2A), (9.3)

in particular, µ is the average lung capacity of the population. (Because the population isfinite, this assumption cannot hold exactly, of course. See Section 9.5.) The eIj ’s are thenthe measurement errors, which will be assumed to be independent N(0, σ2

e), and independentof the µI ’s. Turning to effects, let αI = µI − µ, so that, with (9.2) and (9.3), we have themodel

yIj = µ + αI + eIj , αI ∼ N(0, σ2A), eIj ∼ N(0, σ2

e), (9.4)

with the eIj ’s and αI ’s independent.The data is assumed to arise by taking a sample I1, . . . , Ig from I, then for each of those

individuals, taking N measurements, so that the data are

yIij = µ + αIi+ eIij, i = 1, . . . , g; j − 1, . . . , N. (9.5)

The double subscript is a little awkward, so we will use the notation

yIij = yij, eIij = eij , and αIi= Ai : yij = µ + Ai + eij . (9.6)

Replacing the α with A is supposed to emphasize its randomness.

9.2. KRONECKER PRODUCTS 101

Although the model looks like the usual (fixed effect) ANOVA model, the randomnessof the Ai’s moves the group effects from the mean to the covariance. That is, the yij’s allhave the same mean µ, but they are not all independent: y11 and y12 are not independent,because they share A1, but y11 and y21 are independent. To find the actual distribution, wewrite the model in the usual matrix formulation, but break out the 1n part:

y = µ1n + (x1, . . . , xg)

A1

A2...

Ag

+ e = µ1n + XGA + e. (9.7)

By assumption on A,

A ∼ Ng(0g, σ2AIg), (9.8)

and is independent of e, hence

E[y] = µ1n Cov[y] = Cov[XGA] + Cov[e] = σ2AXGX′

G + σ2eIn. (9.9)

If g = 2 and N = 3, this covariance is explicitly

Cov[y] = σ2A

1 01 01 00 10 10 1

1 01 01 00 10 10 1

′

+ σ2eI6 (9.10)

=

σ2A + σ2

e σ2A σ2

A 0 0 0σ2

A σ2A + σ2

e σ2A 0 0 0

σ2A σ2

A σ2A + σ2

e 0 0 00 0 0 σ2

A + σ2e σ2

A σ2A

0 0 0 σ2A σ2

A + σ2e σ2

A

0 0 0 σ2A σ2

A σ2A + σ2

e

. (9.11)

Thus the yij’s all have the same variance, but the covariances depend on whether they arefrom the same group.

Before proceeding, we introduce Kronecker products, which make the subsequent calcu-lations easier.

9.2 Kronecker products

The relevant matrices for balanced layouts, especially with many factors, can be handledmore easily using the Kronecker product notation. Next is the definition.


Definition 22 If A is a p × q matrix and B is an n × m matrix, then the Kroneckerproduct is the (np) × (mq) matrix A ⊗B given by

A ⊗ B =

a11B a12B · · · a1qBa21B a22B · · · a2qB

......

. . ....

ap1B ap2B · · · apqB

. (9.12)

The XG matrix for the balanced one-way ANOVA can then be written as

Xg = Ig ⊗ 1N . (9.13)

E.g., with g = 3 and N = 2,

I3 ⊗ 12 =

1 0 00 1 00 0 1

⊗

(11

)=

1

(11

)0

(11

)0

(11

)

0

(11

)1

(11

)0

(11

)

0

(11

)0

(11

)1

(11

)

=

1 0 01 0 00 1 00 1 00 0 10 0 1

. (9.14)

Kronecker product are nice because many operations can be performed componentwise.Here is a collection of some properties.

Proposition 20 Presuming the operations (addition, multiplication trace, inverse) makesense,

(A ⊗ B)′ = A′ ⊗ B′ (9.15)

(A ⊗B)(C ⊗ D) = (AC) ⊗ (BD) (9.16)

(A ⊗B)−1 = A−1 ⊗ B−1 (9.17)

(A ⊗B) + (A ⊗ D) = A ⊗ (B + D) (9.18)

(A ⊗B) + (C ⊗ B) = (A + C) ⊗ B (9.19)

(A⊗ B) ⊗C = A ⊗ (B⊗ C) (9.20)

trace(A ⊗ B) = trace(A) × trace(B). (9.21)

9.3. ONE-WAY RANDOM EFFECTS: ESTIMATION AND TESTING 103

These properties can be used to show that if X = X1 ⊗ X2, and the columns of thesematrices are linearly independent, the projection matrix onto C(X) is

M = X(X′X)−1X′

= (X1 ⊗ X2)((X1 ⊗ X2)′(X1 ⊗ X2))

−1(X1 ⊗ X2)′

= (X1 ⊗ X2)((X′1X1) ⊗ (X′

2X2))−1(X1 ⊗ X2)

′

= (X1 ⊗ X2)((X′1X1)

−1 ⊗ (X′2X2)

−1)(X′1 ⊗ X′

2)

= (X1(X′1X1)

−1 ⊗X2(X′2X2)

−1)(X′1 ⊗X′

2)

= (X1(X′1X1)

−1X′1) ⊗ (X2(X

′2X2)

−1X′2). (9.22)

That is, it decomposes into the Kronecker product of two little projection matrices.

9.3 One-way random effects: Estimation and testing

As in (9.13), the balanced one-way random effects ANOVA model is

y = µ1n + XGA = µ(1g ⊗ 1N ) + (Ig ⊗ 1N)A + e. (9.23)

The fixed effect version would have α in place of A. Writing 1n as a Kronecker product mayseem unnecessary here, but it does help to make the notation consistent.

Inferences that would be of interest include estimating µ (and the standard error), σ2A

and σ2e . The null hypothesis for testing for group effects in this case would be that all αI ’s

in the population are equal to 0, not just the ones in the sample. That is, we would testwhether the variance of the Ai’s is zero:

H0 : σ2A = 0 versus HA : σ2

A > 0. (9.24)

We start by finding the relevant projections and their distributions.By (9.22), the projection matrix for MG is

MG = Ig ⊗ (1N(1′N1N)−11′N) = Ig ⊗ ((1/N)1N1′N) = Ig ⊗ JN , (9.25)

where we are definingJk = (1/k)1k1k, (9.26)

the k×k matrix consisting of all (1/k)’s. Note that it is the projection matrix for span{1k}.Similarly, we can write M∅, the projection matrix for span{1n}, as

M∅ = Jg ⊗ JN (9.27)

(which also equals Jn). A summary:

Constant: 1n = 1g ⊗ 1N M∅ = Jg ⊗ JN

Groups: XG = Ig ⊗ 1N MG = Ig ⊗ JN(9.28)


Then we have

MG·∅ = MG −M∅ = (Ig ⊗ JN) − (Jg ⊗ JN ) = (Ig − Jg) ⊗ JN = Hg ⊗ JN , (9.29)

where now we are definingHk = Ik − Jk. (9.30)

This Hk is the projection matrix for span{1k}⊥, that is, it subtracts the mean from eachelement of a vector. Note that because they are projections on orthogonal spaces (or by easycalculation),

JkHk = HkJk = 0. (9.31)

Also,In − MG = (Ig ⊗ IN) − (Ig ⊗ JN) = Ig ⊗ (IN − JN) = Ig ⊗ HN . (9.32)

The mean and covariance of y in (9.23) can be written

E[y] = 1nµ, (9.33)

and

Cov[y] = σ2A(Ig ⊗ 1N)(Ig ⊗ 1N)′ + σ2

eIn

= σ2A(Ig ⊗ 1N1′N) + σ2

eIn

= N σ2A(Ig ⊗ JN) + σ2

eIn

= N σ2A MG + σ2

eIn. (9.34)

We calculate the same projections as for the fixed-effect case, but they have differentdistributions than before. In particular, the projections onto MG·∅ and M⊥

G are

yG·∅ = MG·∅ y and y − y

G= (In −MG)y. (9.35)

The means of both projections are 0n, because 1n is orthogonal to the two relevant spaces.The covariances can be found using (9.8):

Cov[yG·∅] = Nσ2

AMG·∅MGMG·∅ + σ2eMG·∅ = (Nσ2

A + σ2e)MG·∅. (9.36)

We are using the fact that MG·∅MG = MG·∅, because if M1 ⊂ M2, then M1M2 = M2M1 =M1. Next,

Cov[y − yG] = σ2

A(In − MG)MG(In − MG) + σ2e(In −MG) = σ2

e(In − MG). (9.37)

Finally, independence follows because

Cov[yG·∅, y − y

G] = Nσ2

A(In −MG)MGMG·∅ + σ2e(In −MG)MG·∅ = 0 + 0 = 0. (9.38)

Thus we have that ‖yG·∅‖2 and ‖y − y

G‖2 are independent, and (because trace(MG·∅) =

g − 1 and trace(In − MG) = n − g),

‖yG·∅‖

2 ∼ (Nσ2A + σe)χ

2g−1 and ‖y − y

G‖2 ∼ σ2

eχ2n−g. (9.39)

9.3. ONE-WAY RANDOM EFFECTS: ESTIMATION AND TESTING 105

The expected means squares are then easy to obtain:

E[‖yG·∅‖

2/(g − 1)] = Nσ2A + σe and E[‖y − y

G‖2/(n − g)] = σ2

e . (9.40)

Now we can write out the ANOVA table, where we put in a column for expected meansquares instead of sample mean squares:

Source Sum of squares Degrees of freedom E[Mean square] FGroups (MG·∅) ‖y

G·∅‖2 g − 1 Nσ2A + σ2

e MSG·∅/MSE

Error (M⊥G) ‖y − y

G‖2 n − g σ2

e —

Total (M⊥∅ ) ‖y − y∅‖2 n − 1 — —

Notice that when σ2A = 0, then the two expected mean squares are equal. That is, one can

use the F statistic to test H0 : σ2A = 0. Is the F ∼ Fg−1,n−g under the null hypothesis? Yes,

just as in the fixed effect case, from (9.39), the numerator and denominator are independentχ2’s divided by their degrees of freedom (the σ2

e ’s cancel).Also, σ2

e = MSE as before, and we can find an unbiased estimate of σ2A:

E[MSG·∅ − MSE] = Nσ2A, hence E

[MSG·∅ − MSE

N

]= σ2

A. (9.41)

Turning to estimating µ, the least squares estimate is µ = y··. Note that µ = (1/n)1′ny,so that

V ar[µ] = (1/n)1′nCov[y](1/n)1n

=1

n2(Nσ2

A1′nMG1n + σ2e1

′n1n)

=1

n2(σ2

A1′g1g + nσ2e)

=1

n2(N2gσ2

A + nσ2e)

=1

gσ2

A +1

nσ2

e , (9.42)

because n = Ng. [A direct way to calculate the variance is to note that y·· = µ+A+e··, andA and e·· are independent means of g and n iid elements, respectively.] To find the estimatedstandard error, note that MSG·∅ estimates n V ar[µ], so that

se(µ) =

√MSG·∅

n. (9.43)

Example. This is Example 10.13.1 from Snedecor and Cochran, Statistical Methods. Twoboars from each of four litters of pigs were studied, and each one’s average weight gain (untilthey became 225 pounds or so) was recorded. So the groups are the litters, g = 4, andN = 2. The data:


Litter → 1 2 3 4yij → 1.18, 1.11 1.36, 1.65 1.37, 1.40 1.07, 0.90

The ANOVA table:

SS df MS FLitters 0.3288 3 0.1096 7.3805Error 0.0594 4 0.0148 —Total 0.3882 7 — —

The F3,4,0.05 = 6.591, which suggests that σ2A > 0, that is, there are differences between

the litters. The y·· = 1.255, and se =√

0.1096/8 = 0.1170. Because the variance estimate inthe standard error is based on the litter mean square, the degrees of freedom are 3, so the95% confidence interval is

(y·· ± t3,0.025se) = (1.255 ± 3.182 × 0.1170) = (0.883, 1.627). (9.44)

Warning. One might be tempted to treat the 8 observations as independent, so that wehave just a regular sample of n = 8, with the se = s/

√8 = 0.2355/

√8 = 0.0833, where s

is the sample standard deviation of the observations. This standard error is somewhat lessthan the correct one of 0.1170. The confidence interval for the mean would be

(y ± tn−1,0.025se) = (1.255 ± 2.365 × 0.0833) = (1.058, 1.452), (9.45)

which is just a bit more than half (53%) the width of the correct one. Thus treating theseobservations as independent would be cheating.

What is correct is to look at the group means y1·, . . . , y4· as a sample of 4 independentobservations (which they are). Then se = s∗/

√4, where s∗ = 0.2341 is the sample standard

deviation of the means. Indeed, this se = 0.1170, which is correct.

9.4 Two-way random effects ANOVA

Return to the lung capacity example, and imagine that in addition to choosing a number ofindividuals, there are a number of doctors administering the lung capacity measurements,and that each doctor tests each individual N times. Then for individual I and doctor J (notJulius Erving), yIJk, k = 1, . . . , N , represent the N measurements. We assume there are rindividuals chosen as a simple random sample from the population I and c doctors chosen asa simple random sample from the population J , and the individuals and doctors are chosenindependently. Letting µIJ be the true lung capacity of individual I measured by doctor J ,the model is

yIJk = µIJ + eIJk. (9.46)

9.4. TWO-WAY RANDOM EFFECTS ANOVA 107

The average for person I, averaged over all doctors, is

µI· = EJ [µIJ ], (9.47)

where “EJ ” indicates taking expected value with J ∈ J as the random variable. Similarly,the average for doctor J , averaged over all I ∈ I, is

µ·J = EI [µIJ ]. (9.48)

Both µI· and µ·J are random. The overall mean µ is the average over all individuals anddoctors:

µ = EI,J [µIJ ] = EI [µI·] = EJ [µ·J ]. (9.49)

The effects are then defined as for the fixed effect model, except that they are random:

Row effect: αI = µI· − µColumn effect: βJ = µ·J − µ

Interaction effect: γIJ = µIJ − µI· − µ·J + µ(9.50)

From (9.47), (9.48) and (9.49), one can see that

EI [αI ] = 0 = EJ [βJ ] = EI [γIJ ] = EJ [γIJ ], (9.51)

which are the analogs of the usual constraints in the fixed effect model. We denote thepopulation variances of these effects by

V arI [αI ] = σ2A, V arJ [βJ ] = σ2

B, V arI,J [γIJ ] = σ2C . (9.52)

Then using (9.51) in (9.46), the model is

yIJk = µ + αI + βJ + γIJ + eIJk, k = 1, . . . , N. (9.53)

The actual data are based on taking simple random samples of the I’s and J ’s, that is,I1, . . . , Ir are the individuals, and J1, . . . , Jc are the doctors, so that the data are

yIiJjk = µ + αIi+ βJj

+ γIiJj+ eIiJjk, i = 1, . . . , r; j = 1, . . . , c; k = 1, . . . , N. (9.54)

As in the one-way case, there are too many double subscripts, so that we change notation to

yijk = µ + Ai + Bj + Cij + eijk, i = 1, . . . , r; j = 1, . . . , c; k = 1, . . . , N. (9.55)

so that Ai = αIi, Bj = βJj

, and Cij = γIiJj.

The effects are then modeled by assuming that they are all independent and normal,which can be written as

A ∼ Nr(0r, σ2AIr), B ∼ Nc(0c, σ

2BIc), C ∼ Nrc(0rc, σ

2CIrc), A, B, and C independent.

(9.56)


A justification for the uncorrelatedness of the effects is found in Section 9.5. Finally, themodel written in matrix form is

y = µ1n + XRA + XCB + XR×CC + e, (9.57)

where in addition to (9.56), the e ∼ Nn(0n, σ2eIn) is independent of the A, B and C. Also,

XR is the n× r matrix with vectors indicating the rows, XC is the n× c matrix with vectorsindicating the columns, and XR×C is the n × (rc) matrix with vectors indicating the cells.

Now the distribution of y is multivariate normal with mean µ and

Cov[y] = σ2AXRX′

R + σ2BXCX′

C + σ2CXR×CX′

R×C + σ2eIn. (9.58)

Inferences that are of interest include estimating µ and the variances, and testing forinteractions (σ2

C = 0) or for row or column effects (σ2A = 0 or σ2

B = 0). We will approachthe task by finding the usual sums of squares and their expected mean squares. Becausewe have a balanced situation, the matrices can be written using Kronecker products, whichhelps in finding the E[MS]’s. The matrices are in the left half:

Constant: 1n = 1r ⊗ 1c ⊗ 1N M∅ = Jr ⊗ Jc ⊗ JN

Rows: XR = Ir ⊗ 1c ⊗ 1N MR = Ir ⊗ Jc ⊗ JN

Columns: XC = 1r ⊗ Ic ⊗ 1N MC = Jr ⊗ Ic ⊗ JN

Rows × Columns: XR×C = Ir ⊗ Ic ⊗ 1N MR×C = Ir ⊗ Ic ⊗ JN .

(9.59)

The projection matrices are then found by finding the component project matrices, e.g.,

MR = (Ir(I′rIr)

−1I′r) ⊗ (1c(1′c1c)

−11′c) ⊗ (1N(1′N1N)−11′N ) = Ir ⊗ Jc ⊗ JN . (9.60)

From these, we can find the projection matrices on the orthogonal subspaces, startingwith

MR·∅ = MR −M∅ = (Ir ⊗Jc ⊗JN )− (Jr ⊗Jc ⊗JN ) = (Ir −Jr)⊗ (Jc ⊗JN ) = Hr ⊗Jc ⊗JN ,(9.61)

and

MC·∅ = MC−M∅ = (Jr⊗Ic⊗JN)−(Jr⊗Jc⊗JN) = Jr⊗(Ic−Jc)⊗JN = Jr⊗Hc⊗JN . (9.62)

Note that

MR·∅MC·∅ = (Hr⊗Jc⊗JN)(Jr⊗Hc⊗JN) = (HrJr)⊗(JcHc)⊗JN = 0⊗0⊗Jn = 0, (9.63)

so that MR·∅ ⊥ MC·∅, as we know. We also know that M(R+C)·∅ = MR·∅ + MC·∅, so that

M(R×C)·(R+C) = M(R×C)·∅ −M(R+C)·∅

= M(R×C)·∅ −MR·∅ − MC·∅

= Ir ⊗ Ic ⊗ JN − Jr ⊗ Jc ⊗ JN −Hr ⊗ Jc ⊗ JN − Jr ⊗ Hc ⊗ JN

= Ir ⊗ Ic ⊗ JN − (Jr + Hr) ⊗ Jc ⊗ JN − Jr ⊗ Hc ⊗ JN

= Ir ⊗ Ic ⊗ JN − Ir ⊗ Jc ⊗ JN − Jr ⊗ Hc ⊗ JN

= Ir ⊗Hc ⊗ JN − Jr ⊗ Hc ⊗ JN

= Hr ⊗Hc ⊗ JN . (9.64)


Finally,In − MR×C = Ir ⊗ Ic ⊗ IN − Ir ⊗ Ic ⊗ JN = Ir ⊗ Ic ⊗ HN . (9.65)

We can summarize these orthogonal projections:

Constant M∅ : M∅ = Jr ⊗ Jc ⊗ JN

Row effects MR·∅ : MR·∅ = Hr ⊗ Jc ⊗ JN

Column effects MC·∅ : MC·∅ = Jr ⊗Hc ⊗ JN

Interactions M(R×C)·(R+C) : M(R×C)·(R+C) = Hr ⊗ Hc ⊗ JN

Error M⊥R×C : In − MR×C = Ir ⊗ Ic ⊗ HN .

(9.66)

Now for the sums of squares. The idea is to find the distribution of the projections ofy on each of the spaces in (9.66). They are straightforward to obtain using the Kroneckerproducts, although may get a bit tedious. The covariance in (9.58) has the XX′ matrices,which can also be more easily written:

XRX′R = Ir ⊗ (1c1

′c) ⊗ (1N1′N) = Ir ⊗ (cJc) ⊗ (NJN ) = (cN)MR,

XCX′C = (1r1

′r) ⊗ Ic ⊗ (1N1′N ) = rJr ⊗ Ic ⊗ (NJN ) = (rN)MC ,

XR×CX′R×C = Ir ⊗ Ic ⊗ (1N1′N) = Ir ⊗ Ic ⊗ (NJN) = NMR×C . (9.67)

ThenCov[y] = cNσ2

AMR + rNσ2BMC + Nσ2

CMR×C + σ2eIn. (9.68)

Start with the constant term, that is, y∅ = M∅y. The

E[y∅] = M∅(µ1n) = µJn1n = µ1n, (9.69)

and

Cov[y∅] = M∅Cov[y]M∅

= M∅(cNσ2AMR + rNσ2

BMC + Nσ2CMR×C + σ2

eIn)M∅

= cNσ2AM∅ + rNσ2

BM∅ + Nσ2CM∅ + σ2

eM∅

= (cNσ2A + rNσ2

B + Nσ2C + σ2

e)M∅. (9.70)

The projections on the other spaces have mean zero, so we need to worry about just thecovariances. For the projection onto MR·∅, we have

Cov[yR·∅] = MR·∅(cNσ2


CMR×C + σ2eIn)MR·∅. (9.71)

We know that MR·∅MR = MR·∅ and MR·∅MR×C = MR·∅ by the subset property. Also,

MR·∅MC = (Hr ⊗ Jc ⊗ JN)(Jr ⊗ Ic ⊗ JN) = 0, (9.72)

because HrJr = 0. Thus

Cov[yR·∅] = (cNσ2

A + Nσ2C + σ2

e)MR·∅. (9.73)


Similarly,Cov[y

C·∅] = (rNσ2B + Nσ2

C + σ2e)MC·∅. (9.74)

For y(R×C)·(R+C)

, we have that M(R×C)·(R+C)MR = M(R×C)·(R+C)MC = 0 because MR and

MC are orthogonal to M(R×C)·(R+C). Thus

Cov[y(R×C)·(R+C)

] = (Nσ2C + σ2

e)M(R×C)·(R+C). (9.75)

Finally, because In −MR×C is orthogonal to all the other projection matrices,

Cov[y − yR×C

] = σ2e(In − MR×C), (9.76)

as usual.These projections can also be seen to be independent, e.g.,

Cov[yR·∅, yC·∅] = MR·∅(cNσ2


CMR×C + σ2eIn)MC·∅

= (cNσ2AMR·∅ + Nσ2

CMR·∅ + σ2eMR·∅)MC·∅

= 0, (9.77)

because MR·∅MC·∅ = 0.The sums of squares of these projections (except onto M∅) are then chi-squared with

degrees of freedom being the trace of the projection matrices, and constant out front beingthe σ2-part of the covariances. Thus the ANOVA table is

Source Sum of squares Degrees of freedom E[Mean square]Rows (MR·∅) ‖y

R·∅‖2 r − 1 cNσ2A + Nσ2

C + σ2e

Columns (MC·∅) ‖yC·∅‖2 c − 1 rNσ2

B + Nσ2C + σ2

e

Interactions (M(R×C)·(R+C)) ‖y(R×C)·(R+C)

‖2 (r − 1)(c − 1) Nσ2C + σ2

e

Error (M⊥R×C) ‖y − y

R×C‖2 n − rc σ2

e

Total (M⊥∅ ) ‖y − y∅‖2 n − 1 —

9.4.1 Estimation and testing

Unbiased estimates of the variance components are immediately obtainable from the expectedmean square column of the ANOVA table:

σ2A =

MSRows − MSInteractions

cN

σ2B =

MSColumns − MSInteractions

rN

σ2C =

MSInteractions − MSE

Nσ2

e = MSE. (9.78)


For µ = y···, note that because y∅ = µ1n, (9.70) shows that V ar[µ] is the upper-right (orany other) elements of Cov[y∅], so that

V ar[µ] =cNσ2

A + rNσ2B + Nσ2

C + σ2e

n=

1

rσ2

A +1

cσ2

B +1

rcσ2

C +1

nσ2

e . (9.79)

The estimated standard error would then be

se(µ) =

√MSRows + MSColumns − MSInteractions

n. (9.80)

For testing, use the expected mean squares column to decide what ratio to take for theF test. For example, to test for average (over the population) column effects, i.e.,

H0 : σ2A = 0 versus HA : σ2

A > 0, (9.81)

The F = MSRows/MSInteractions, because under the null hypothesis, both means squareshave the same constant Nσ2

C + σ2e . The degrees of freedom for the F are then r − 1 and

(r − 1)(c− 1). Generally, testing for row effects when there are interaction effects present isuninteresting, so this particular test has little application.

It is of interest to test for interaction,

H0 : σ2C = 0 versus HA : σ2

C > 0, (9.82)

which would use F = MSInteractions/MSE, just as in the fixed effect case, with degrees offreedom (r − 1)(c − 1) and n − rc.

If one assumes there are no interactions, then σ2C = 0, so that the ANOVA table simplifies

to

Source Sum of squares Degrees of freedom E[Mean square] FRows (MR·∅) ‖y

R·∅‖2 r − 1 cNσ2A + σ2

e MSRows/MSE

Columns (MC·∅) ‖yC·∅‖2 c − 1 rNσ2

B + σ2e MSColumns/MSE

Error (M⊥R+C) ‖y − y

R+C‖2 n − r − c + 1 σ2

e —

Total (M⊥∅ ) ‖y − y∅‖2 n − 1 — —

The F tests for row and column effects are then as for the fixed effect case.


9.5 The distribution of the effects

9.5.1 One-way ANOVA

Starting with the one-way random effects ANOVA, the justification of the distributionalassumptions, at least as far as the means and covariances go, is based on I1, . . . , Ir’s beinga simple random sample from the population I, where I is a very large relative to r. Thepopulation mean and variance of the µI is then defined by the usual finite-population values,

µ = EI [µI ] =

∑i∈I µi

NIand σ2

A = V arI [µI ] =

∑i∈I(µi − µ)2

NI. (9.83)

(NI is the number of individuals in the population.) It then follow that the effects αI = µI−µhave

EI [αI ] = 0 and V arI [αI ] = σ2A. (9.84)

Because we are sampling without replacement, Ii and Ij are not independent, hence theαI1, . . . , αIg are not independent. The covariances between pairs are equal, so it is enoughto find the covariance between αI1 and αI2. Because the means are zero,

Cov[αI1, αI2 ] =

∑i6=j αiαj

NI(NI − 1). (9.85)

Because the sum of the αi’s over the population is zero,

0 = (∑

i∈Iαi)

2 =∑

i∈Iα2

i +∑

i6=j

αiαj = NIσ2A +

∑

i6=j

αiαj . (9.86)

Then

Cov[αI1, αI2 ] = − σ2A

NI − 1. (9.87)

At this point we could stop and say, reasonably, that if NI is very large, the correlation,1/(NI − 1), is negligible, so that assuming independence should be fine. But to be moreprecise, for A = (αI1 , . . . , αIg)

′ as in (9.8), we have to modify the covariance matrix toaccount for the correlations, so that

Cov[A] = σ2A

(NI

NI − 1Ig −

1

NI − 11g1

′g

). (9.88)

Then

Cov[XGA] = σ2A

(NI

NI − 1XGX′

G − 1

NI − 1XG1g1

′gX

′G

). (9.89)

Now XG1g = (Ig ⊗1N)(1g ⊗1) = 1g ⊗1N , so that XG1g1′gX

′G = gN(Jg ⊗JN ) = gnM∅, hence

Cov[XGA] = σ2A

(NI

NI − 1NMG − 1

NI − 1gnM∅

). (9.90)

9.5. THE DISTRIBUTION OF THE EFFECTS 113

For y, we then have that

Cov[y] = σ2A

(NI

NI − 1NMG − 1

NI − 1gnM∅

)+ σ2

eIn. (9.91)

The covariances of the projections onto MG·∅ and M⊥G are

Cov[yG·∅] = MG·∅(σ

2A

(NI

NI − 1NMG − 1

NI − 1gnM∅

)+ σ2

eIn)MG·∅

= (Nσ2A

NINI − 1

+ σ2e)MG·∅ (9.92)

and

Cov[y − yG] = (In − MG)(σ2

A

(NI

NI − 1NMG − 1

NI − 1gnM∅

)+ σ2

eIn)(In −MG)

= σ2e(In − MG). (9.93)

These two projections can also be shown to be independent. The ANOVA table now is

Source Sum of squares Degrees of freedom E[Mean square] FGroups (MG·∅) ‖y

G·∅‖2 g − 1 Nσ2ANI/(NI − 1) + σ2

e MSG·∅/MSE

Error (M⊥G) ‖y − y

G‖2 n − g σ2

e —

Total (M⊥∅ ) ‖y − y∅‖2 n − 1 — —

Comparing this table to that in Section 9.1, we see that the only difference is the factorNI/(NI − 1), which is practically 1. Of course, typically one does not really know NI , soignoring the factor is reasonable. But note that the F test for testing σ2

A = 0 is exactly thesame as before.


Chapter 10

Mixed Models

A mixed model is one with some fixed effects and some random effects. The most basicis the randomized block design, exemplified by the example on hot dogs in (9.1), wherethe people are random “blocks”, and the hot dog formulations are the fixed effects. Mixedmodels can be described in general by the model

y = XRandomA + XF ixedβ + e, (10.1)

where the X’s are fixed design matrices, A is a vector (a × 1) of random effects, and β is avector of fixed effects. All the models we have considered so far are of this form, mostly withno fixed effects, and in Chapter 9, with random effects plus the simple fixed effect matrixXF ixed = 1n.

The distributional assumption on A is key, and can make the analysis of the model quitechallenging. Often one assumes A ∼ Na(0a,ΣA), where ΣA may have a simple form, or maynot. With the assumption that e is independent of A, and e ∼ Nn(0n, σ

2eIn), the distribution

of y isy ∼ Nn(XF ixedβ,XRandomΣAX′

Random + σ2eIn). (10.2)

We will not be dealing with the general case, but present the two-way balanced mixedmodel, which is fairly straightforward to analyze. More complex models need complex al-gorithms, found in packages like SPSS and SAS. The next section looks at the balancedrandomized block case, without interactions. Section 10.2 adds potential interactions. Ineach case, we find the ANOVA table and E[MS]’s.

10.1 Randomized blocks

As in the hot dog example, the rows are randomly chosen from a large population I, andthe columns represent c fixed treatments. The parameter µIj is the average for individual Iand column treatment j, e.g., the average rating person I would give to hot dog formulationj. The column means are found by averaging over the entire population,

µ·j = EI [µIj]. (10.3)

115

116 CHAPTER 10. MIXED MODELS

It is a fixed parameter, because formulation j is a fixed treatment. The overall average andcolumn effects are then defined as usual:

µ =µ·1 + · · · + µ·c

c, βj = µ·j − µ, (10.4)

so that β1 + · · ·+ βc = 0.For the row effects, we look at the

α∗Ij = µIj − µ·j = µIj − µ − βj (10.5)

for each j, it being the I th individual’s mean for treatment j relative to the population. Notethat EI [α∗

Ij] = 0. If these values are different for different j’s, then there is interaction. It iscertainly possible there is interaction, e.g., some people may tend to rate soft hot dogs higherthan average, and hard hot dogs softer than average. No interaction means each person isconsistent over formulations, that is, is the same amount above average (or below average)for each j. Calling that amount αI , we than have α∗

Ij = αI for each j, so that

µIj = µ + αI + βj , (10.6)

which is of course the usual additive model.In the balanced design, each individual has N independent measurements for each treat-

ment, so that the model isyIjk = µ + αI + βj + eIjk. (10.7)

Letting I1, . . . , Ir be the sampled individuals, the data are

yIijk = µ + αIi+ βj + eIijk, i = 1, . . . , r; j = 1, . . . , c; k = 1, . . . , N. (10.8)

Rewriting, with Ai = αIi, we have

yijk = µ + Ai + βj + eijk, i = 1, . . . , r; j = 1, . . . , c; k = 1, . . . , N, (10.9)

or with matrices,y = µ1n + XRA + XCβ + e. (10.10)

We model the Ai’s here the same as for the one-way random effects model, so that

A ∼ Nr(0r, σ2AIr), e ∼ Nn(0n, σ2

eIn), (10.11)

and A and e are independent.The mean and covariance of y are

E[y] = µ1n + XCβ, Cov[y] = σ2AXRX′

R + σ2eIn, (10.12)

or, using Kronecker products,

E[y] = (1r ⊗ 1c ⊗ 1N)µ + (1r ⊗ Ic ⊗ 1N)β,

Cov[y] = cNσ2AMR + σ2

eIn, (10.13)

10.1. RANDOMIZED BLOCKS 117

as in (9.67). We obtain the usual projections, onto MR·∅,MC·∅, and M⊥R+C . (The error is

from the additive model.)For the means, we have

E[yR·∅] = MR·∅((1r ⊗ 1c ⊗ 1N)µ + (1r ⊗ Ic ⊗ 1N)β)

= (Hr ⊗ Jc ⊗ JN)(1r ⊗ 1c ⊗ 1N)µ + (Hr ⊗ Jc ⊗ JN)(1r ⊗ Ic ⊗ 1N)β

= 0, (10.14)

because Hr1r = 0r. Similarly,E[y − y

R+C] = 0. (10.15)

The other one is not zero,

E[yC·∅] = MC·∅((1r ⊗ 1c ⊗ 1N)µ + (1r ⊗ Ic ⊗ 1N )β)

= (Jr ⊗ Hc ⊗ JN)(1r ⊗ 1c ⊗ 1N )µ + (Jr ⊗ Hc ⊗ JN)(1r ⊗ Ic ⊗ 1N )β

= (1r ⊗ Hc ⊗ 1N )β. (10.16)

Then

‖E[yC·∅]‖

2 = β′(1r ⊗Hc ⊗ 1N)′(1r ⊗ Hc ⊗ 1N)β

= β′(1′r1r ⊗ HcHc ⊗ 1′N1N)β

= rNβ ′Hcβ

= rN‖β‖2. (10.17)

The last step follows from Hcβ = β because the βj’s sum to 0.The covariances are M(cNσ2

AMR + σ2eIn)M for the various projection matrices M. The

MC·∅MR = 0, MR·∅MR = MR·∅, and (In − MR+C)MR = 0. One can also show that thethree projections are independent. Putting these together, we have

Projection ‖Mean‖2 Covariance E[MS]y

R·∅ 0 (cNσ2A + σ2

e)MR·∅ cNσ2A + σ2

e

yC·∅ rN‖β‖2 σ2

eMC·∅ rN‖β‖2/(c − 1) + σ2e

y − yR+C

0 σ2e(In − MR+C) σ2

e

(10.18)

The expected mean squares are found by dividing the ‖Mean‖2 by the degrees of freedom,then adding that to the σ2 part of the covariance. The ANOVA table is then

Source Sum of squares df E[Mean square] FRows (MR·∅) ‖y

R·∅‖2 r − 1 cNσ2A + σ2

e MSRows/MSE

Columns (MC·∅) ‖yC·∅‖2 c − 1 rN‖β‖2/(c − 1) + σ2

e MSColumns/MSE

Error (M⊥R+C) ‖y − y

R+C‖2 n − r − c + 1 σ2

e —

Total (M⊥∅ ) ‖y − y∅‖2 n − 1 — —



Testing is the same as for the fixed effects additive case, except that for the rows, we aretesting H0 : σ2

A = 0. The estimation of contrasts in the βj’s (or µ·j’s) is also the same. Thatis, if c′β is a contrast, then its estimate is

c′β = c1y·1· + · · ·+ ccy·c· = a′y (10.19)

where a ∈ MC·∅. (Specifically, a = (1r ⊗ c ⊗ 1N)/rN .) Then

Cov[a′y] = a′Cov[y]a

= a′(σ2AXRX′

R + σ2eIn)a

= ‖a‖2σ2e , (10.20)

because a ∈ MC·∅ ⊥ MR, so that a′XR = 0′r. Note that a‖2 = r‖c‖2N/(rN)2 = ‖c‖2/rN ,so that we have

se(c′β) = ‖c‖√

MSE

rN. (10.21)

Example. Five soybean treatments (one a control) were tested on each of five plots of land.That is, each of five plots of land had five batches of soybean seeds, each batch receiving oneof the five treatments. The y is then the number of failures (out of 100 seeds) for that batch.Thus there are r = 5 plots, the random effects, and c = 5 treatments, the fixed effects. Eachplot×treatment combination produced only one observation, so N = 1. (From Snedecor andCochran, Statistical Methods, Sixth Edition, Table 11.2.1.)

Here is the ANOVA table:

Source SS df MS FPlots 49.84 4 12.46 2.303Treatments 83.84 4 20.96 3.874Error 86.56 16 5.41 —Total 220.24 24 — —

The F4,16,0.05 = 3.007. Thus there is a significant treatment effect, but we can acceptσ2

A = 0. The latter means that the plots a reasonably homogeneous, so maybe next timeit is not worth bothering to use randomized blocks, but just randomly place the 25 sets ofseeds around the land.

10.2 Two-way mixed model with interactions

Again we have the rows begin randomly selected from I, and the columns being fixed treat-ments. With µIj being the mean of the I th individual receiving the jth treatment, we againdefine

µ·j = EI [µIj ], µ =µ·1 + · · ·+ µ·c

c, βj = µ·j − µ, (10.22)

10.2. TWO-WAY MIXED MODEL WITH INTERACTIONS 119

andα∗

Ij = µIj − µ − βj . (10.23)

These definitions imply that

EI [α∗Ij ] = 0 for each j, and β1 + · · ·+ βc = 0. (10.24)

No interactions means the α∗Ij is the same for each j. Here we are not making that

assumption, although may wish to test that hypothesis. We can define the average effect forI to be

αI = α∗I· =

α∗I1 + · · ·+ α∗

Ic

c, (10.25)

and the interactions to beγIj = α∗

Ij − αI . (10.26)

Note thatEI [γIj] = 0 and γI1 + · · ·+ γIc = 0. (10.27)

Then the general model is

yIjk = µ + αI + βj + γIj + eIjk. (10.28)

Because I is random, αI and the γIj’s are random. Unfortunately, we cannot justifiablyassume they are independent. In particular, the γIj’s cannot be independent (unless theyare all 0) because they have to sum to 0 over j. In Section 10.2.2 we present an approachto the joint distribution of the αI and γIj’s that yields fairly simple results. Here, we willpresent just the outcome.

The data are then I1, . . . , Ir, a simple random sample from I. Making a similar changein notation as from (9.54) to (9.55), the model is

yijk = µ + Ai + βj + Cij + eijk, i = 1, . . . , r; j = 1, . . . , c; k = 1, . . . , N. (10.29)

The Ai and Cij’s are random, the βj’s are fixed. In matrix form,

y = µ1n + XRA + XCβ + XR×CC + e. (10.30)

The ANOVA table (under assumptions from Section 10.2.2) is

Source Sum of squares Degrees of freedom E[Mean square]Rows (random) ‖y

R·∅‖2 r − 1 cNσ2A + σ2

e

Columns (fixed) ‖yC·∅‖2 c − 1 rN‖β‖2/(c − 1) + Nσ2

C + σ2e

Interactions (random) ‖y(R×C)·(R+C)

‖2 (r − 1)(c − 1) Nσ2C + σ2

e

Error ‖y − yR×C

‖2 n − rc σ2e

Total ‖y − y∅‖2 n − 1 —


It is helpful to compare the expected mean squares for different models. Below we havethem (all in the two-way balanced ANOVA with interactions), with zero, one or two of theeffects random:

Source Rows fixed Rows random Rows randomColumns fixed Columns random Columns fixed

Rows cN‖α‖2/(r − 1) + σ2e cNσ2

A + Nσ2C + σ2

e cNσ2A + σ2

e

Columns rN‖β‖2/(c − 1) + σ2e rNσ2

B + Nσ2C + σ2

e rN‖β‖2/(c − 1) + Nσ2C + σ2

e

Interactions N‖γ‖2/((r − 1)(c − 1)) + σ2e Nσ2

C + σ2e Nσ2

C + σ2e

Error σ2e σ2

e σ2e


To test for interactions (σ2C = 0) or for block effects (σ2

A = 0), the usual F tests apply,that is, MSInteractions/MSE and MSRows/MSE, respectively. For testing the fixed effects,which are usually the effects of most interest, the F uses the interaction mean square in thedenominator:

H0 : β = 0c , F =MSColumns

MSInteractions∼ Fc−1,(r−1)(c−1) under the null hypothesis. (10.31)

Turn to contrasts in the column effects, c′β with c1 + · · · + cc = 0. The estimate is theusual one,

c′β = c1y·1· + · · ·+ ccy·c·. (10.32)

The standard error uses the interactions mean square (not the MSE), in the standard error:

se(c′β) = ‖c‖√

MSInteractions

rN. (10.33)

See (10.49).

10.2.2 Distributional assumptions on the random effects

One approach to modeling the joint distribution of the α∗Ij ’s from (10.23), α∗

Ij = µIj −µ−βj ,is to assume that the variances are equal (to σ2, say), and the correlations are equal (to ρ).That is, letting α∗

I = (α∗I1, . . . , α

∗Ic)

′, assume that

Cov[α∗I ] =

σ2 ρσ2 · · · ρσ2

ρσ2 σ2 · · · ρσ2

......

. . ....

ρσ2 ρσ2 · · · σ2

. (10.34)


Such matrices can be written as a linear combination of I and J, in this case

Cov[α∗I ] = (1 − ρ)σ2Ic + cρσ2Jc. (10.35)

This model is restrictive, but captures the idea that the effects are exchangeable, that is,the distribution is the same when permuting the order of the j’s. It can make sense whenthere is no particular reason to believe that two treatments are any more or less related thanany other two treatments. It may not be reasonable if the j’s represent time points, and oneexpects measurements made close together in time are more highly correlated.

Now (10.25) and (10.26) show that

α∗I = αI1c + γ

I, (10.36)

where γI

= (γI1, . . . , γIc)′. We can write the quantities on the right as

αI1c = Jcα∗I and γ

I= Hcα

∗I . (10.37)

Then from (10.35),

Cov[αI1c] = Jc((1 − ρ)σ2Ic + cρσ2Jc)Jc = ((1 − ρ)σ2 + cρσ2)Jc = σ2(1 + (c − 1)ρ)Jc,

Cov[γI] = Hc((1 − ρ)σ2Ic + cρσ2Jc)Hc = σ2(1 − ρ)Hc. (10.38)

Also, Cov[αI1c, γI] = 0, so that the vectors are independent. For convenience, define σ2

A =V ar[αI ] = σ2(1 + (c − 1)ρ)/c and σ2

C = σ2(1 − ρ), so that

Cov[αI1c] = cσ2AJc and Cov[γ

I] = σ2

CHc. (10.39)

Letting yI

be the cN observations from individual I, we have that

yI

=

yI11

yI12...

yI1N...

yIc1...

yIcN

= (1c ⊗ 1N)µ + (1c ⊗ 1N )αI + (Ic ⊗ 1N)β + (Ic ⊗ 1N)γI+ eI , (10.40)

eI being the part of e for individual I. Then

Cov[yI] = V ar[αI ](1c ⊗ 1N)(1c ⊗ 1N)′ + (Ic ⊗ 1N)Cov[γ

I](Ic ⊗ 1N)′ + σ2

eIcN

= (Nc)σ2A(Jc ⊗ JN) + σ2

C(Ic ⊗ 1N )(Hc ⊗ 1)(Ic ⊗ 1N)′ + σ2eIcN

= (Nc)σ2A(Jc ⊗ JN) + Nσ2

C(Hc ⊗ JN) + σ2e(Ic ⊗ IN). (10.41)


Turning to the data, we have I1, . . . , Ir a random sample from I. We again model theindividuals as being independent, so that with y = (y′

I1, . . . , y′

Ir)′, the overall covariance is

Ir ⊗ Cov[yI] in (10.41):

Cov[y] = Ncσ2A(Ir ⊗ Jc ⊗ JN) + Nσ2

C(Ir ⊗ Hc ⊗ JN) + σ2e(Ir ⊗ Ic ⊗ IN). (10.42)

Also, for the mean we use 1n ⊗ E[yI], so by (10.40),

E[y] = (1r ⊗ 1c ⊗ 1N )µ + (1r ⊗ Ic ⊗ 1N)β. (10.43)

For the projections, we round up the usual suspects from (9.66):

MR·∅ = Hr ⊗ Jc ⊗ JN

MC·∅ = Jr ⊗ Hc ⊗ JN

M(R×C)·(R+C) = Hr ⊗Hc ⊗ JN

In − MR×C = Ir ⊗ Ic ⊗HN .

(10.44)

For these, all the means are zero except for the second one:

E[yC·∅] = MC·∅ = (Jr ⊗ Hc ⊗ JN)(1r ⊗ Ic ⊗ 1N)β = (1r ⊗ Hc ⊗ 1N)β, (10.45)

hence

‖E[yC·∅]‖

2 = rN‖β‖2. (10.46)

Applying the projection matrices to the covariance in (10.42) yields

Cov[yR·∅] = Ncσ2

A(Hr ⊗ Jc ⊗ JN) + 0 + σ2e(Hr ⊗ Jc ⊗ JN) = (Ncσ2

A + σ2e)MR·∅

Cov[yC·∅] = 0 + Nσ2

C(Jr ⊗ Hc ⊗ JN) + σ2e(Jr ⊗Hc ⊗ JN) = (Nσ2

C + σ2e)MC·∅,

Cov[y(R×C)·(R+C)

] = 0 + Nσ2C(Hr ⊗ Hc ⊗ JN) + σ2

e(Hr ⊗ Hc ⊗ JN) = (Nσ2C + σ2

e)M(R×C)·(R+C),

Cov[y − yR×C

] = 0 + 0 + σ2e(Ir ⊗ Ic ⊗ HN) = σ2

e(In − MR×C). (10.47)

The expected mean squares are then found by adding ‖E[y]‖2/df to the σ2 part of thecovariances, i.e., (10.46) and (10.47) combine to show

Source Degrees of freedom E[Mean square]Rows (random) r − 1 cNσ2

A + σ2e

Columns (fixed) c − 1 rN‖β‖2/(c − 1) + Nσ2C + σ2

e

Interactions (random) (r − 1)(c − 1) Nσ2C + σ2

e

Error n − rc σ2e

It is also easy to show these projections are independent (because the projection matricesare orthogonal), hence we have the ANOVA table at the end of Section 10.2.


Finally, for contrast c′β, where a = (1r ⊗ c ⊗ 1N)/rN , we have by (10.42),

V ar[a′y] = a′Cov[y]a

= (1r ⊗ c ⊗ 1N)′(Ncσ2A(Ir ⊗ Jc ⊗ JN) + Nσ2

C(Ir ⊗ Hc ⊗ JN) + σ2e(Ir ⊗ Ic ⊗ IN))

(1r ⊗ c ⊗ 1N)/(rN)2

= (0 + rN2‖c‖2σ2C + rN‖c‖2σ2

e)/(rN)2

= ‖c‖2

(Nσ2

C + σ2e

rN

). (10.48)

We used the fact that the elements of c sum to 0, hence Hcc = c and Jcc = 0c. The estimatedstandard error then uses the interaction mean square, i.e.,

se(a′y) = ‖c‖√

MSInteractions

rN. (10.49)

Example. Problem 8.1 in Scheffe’s The Analysis of Variance has data on flow rates offuel throught three types of nozzles, which are the fixed effects. Each of five operators (therandom effects) tested each nozzle three times. Thus we have r = 3, c = 5, and N = 3. TheANOVA table is

Source SS df MS E[MS] FNozzles (fixed) 1427.0 2 713.5 cN‖α‖2/(r − 1) + Nσ2

C + σ2e 3.13

Operators (random) 798.8 4 199.7 rNσ2B + σ2

e 1.97Interactions (random) 1821.5 8 227.7 Nσ2

C + σ2e 2.25

Error 3038.0 30 101.3 σ2e −−−

Total 7085.2 44 −−− −−− −−−(10.50)

To test for nozzle effects, we use F = MSNozzle/MSInteractions = 3.13. The F2,8,0.05 =4.459, so this effect is not significant. To test the operator effect, F = MSOperators/MSE =1.97, which does not look significant. (F4,30,0.05 = 2.690.) For interactions, F8,30,0.05 = 2.266,so it is just barely not significant. It is close enough to be leery of assuming the interactionsare 0. If you could assume that, then the F for nozzles would be MSNozzle/MSE = 7.04,which is very significant (F2,30,0.05 = 3.316).


Chapter 11

Three- (and higher-) way ANOVA

There is no problem handling models with more than two factors. If all factors are fixed,then the usual linear model theory applies. If some are, or all, factors are random, thenthings become more complicated, but in principle everything we have done so far carriesover. We will indicate the factors by A, B, C, etc., rather than rows, columns, . . .. Thenumber of levels will be the respective lower-case letters, that is a levels of factor A, b levelsof factor B, etc.

The three-way ANOVA is given by

yijkl = µijk + eijkl, i = 1, . . . , a; j = 1 . . . , b; k = 1, . . . , c; l = 1, . . . , Nijk. (11.1)

The µijk is the average for observations at level i of factor A, level j of factor B, and levelk of factor C. The Nijk is the number of observations at that combination of levels.

The simplest model in the additive one, which as in the two-way ANOVA, means thateach factor adds the same amount no matter the levels of the other factors, so that

µijk = µ + αi + βj + γk (11.2)

for some parameters αi, βj , γk. These parameters (effects) are not estimable without someconstraints. Effects can be defined in terms of the means. The simplest is to take straightaverages. A dot indicates averaging over that index, e.g,

µ·j· =

∑ai=1

∑ck=1 µijk

ac. (11.3)

Then the overall mean is µ = µ···, and the main effects are

αi = µi·· − µ

βj = µ·j· − µ

γk = µ··k − µ. (11.4)

These definition imply the constraints

a∑

i=1

αi =b∑

j=1

βj =c∑

k=1

γk = 0. (11.5)

125

126 CHAPTER 11. THREE- (AND HIGHER-) WAY ANOVA

Violations of the additive model can take a number of different forms. Two-way inter-actions are the nonadditive parts of the µij·, µi·k and µ·jk’s. That is, if the µij·’s satisfiedan additive model, then we would have µij· = µ + αi + βj . The AB interactions are thedifferences µij· − (µ + αi + βj). In order not to run out of Greek letters, we denote theseinteractions (αβ)ij. This (αβ) is to be considered a single symbol, not the product of twoparameters. Then

(αβ)ij = µij· − (µ + αi + βj)

(αγ)ik = µi·k − (µ + αi + γk)

(βγ)jk = µ·jk − (µ + βj + γk), (11.6)

which can also be written

(αβ)ij = µij· − µi·· − µ·j· + µ···

(αγ)ik = µi·k − µi·· − µ··k + µ···

(βγ)jk = µ·jk − µ·j· − µ··k + µ···. (11.7)

These parameters sum to zero over either of their indices, e.g.,

a∑

i=1

(αβ)ij =b∑

j=1

(αβ)ij = 0, (11.8)

and similarly for the others. The model with all two-way interactions is then

µijk = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk. (11.9)

The model (11.9) is not the saturated model, that is, it may not hold. The differencebetween µijk and the sum of those parameters is called the three-way interaction:

(αβγ)ijk = µijk − (µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk)

= µijk − µij· − µi·k − µ·jk + µi·· + µ·j· + µ··k − µ···. (11.10)

A non-zero two-way interaction, say between factors A and B, means that the effect offactor A can be different for different levels of factor B, and vice versa. A non-zero three-wayinteraction means that the effect of factor A can be different for each combination of levelsof B and C; or that the effect of factor B can be different for each combination of levels of Aand C; or that the effect of factor C can be different for each combination of levels of A andB. For purposes of interpretation, it is easiest if there are few such high-order interactions,although in some cases these interactions may be very important. For example, which drugis best may depend on a combination of sex, age, and race.

The saturated model, which puts no restrictions on the means, is then

µijk = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk. (11.11)

11.1. HIERARCHY OF MODELS 127

Adding in the errors, it is, from (11.1),

yijkl = µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk + eijkl. (11.12)

Writing it out in matrix form, it is

y = 1nµ + XAα + XBβ + XCγ + XA×B(αβ) + XA×C(αγ) + XB×C(βγ) + XA×B×C(αβγ) + e.(11.13)

Rather than formally describe these matrices in general, we look at the balanced case,so that they can be given using Kronecker products. There are four components to eachmatrix, one for each factor, then the 1N for the repetitions. For each matrix, the relevantfactors’ components receive identity matrices, the others receive vectors of ones:

1n = 1a ⊗ 1b ⊗ 1c ⊗ 1N

XA = Ia ⊗ 1b ⊗ 1c ⊗ 1N

XB = 1a ⊗ Ib ⊗ 1c ⊗ 1N

XC = 1a ⊗ 1b ⊗ Ic ⊗ 1N

XA×B = Ia ⊗ Ib ⊗ 1c ⊗ 1N

XA×C = Ia ⊗ 1b ⊗ Ic ⊗ 1N

XB×C = 1a ⊗ Ib ⊗ Ic ⊗ 1N

XA×B×C = Ia ⊗ Ib ⊗ Ic ⊗ 1N

(11.14)

11.1 Hierarchy of models

In the two-way ANOVA, the models considered preserve a certain hierarchy: If any of themain effects is in the model, µ is in the model, and if there is two-way interaction, then thetwo main effects are in the model. For three-way models, the same type of conditions aretypically invoked:

• If a main effect is in the model, the µ is in the model;

• If a two-way interaction is in the model, then the two corresponding main effects arein the model (e.g., if (αγ)ik is in the model, so are αi and γk);

• If the three-way interaction is in the model, then all two-way interactions are in themodel (so that it is the saturated model).

(How many such models are there?)


Each model can be described by giving which factors and interactions are present. Hereare some examples (and notation):

Model Vector space µijk

Saturated MA×B×C µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk

All two-way MA×B+B×C+A×C µ + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk

Additive MA+B+C µ + αi + βj + γk

No A × B interaction MB×C+A×C µ + αi + βj + γk + (αγ)ik + (βγ)jk

A × B interaction MA×B+C µ + αi + βj + γk + (αβ)ij

No B effect MA×C µ + αi + γk + (αγ)ik

No effects M∅ µ(11.15)

There is a myriad of testing problems, with the null model nested in the alternative model.For example, testing for three-way interactions has the all two-way interaction model (11.9)as null, and the saturated model as alternative:

H0 : MA×B+B×C+A×C versus HA : MA×B×C. (11.16)

Testing for additivity given no three-way interaction has the additive model (11.2) as nulland the all two-way model as alternative:

H0 : MA+B+C versus HA : MA×B+B×C+A×C . (11.17)

Generally, the idea is to find the simplest model that still fits. That is, if a parameteris found to be significant, then it stays in the model, along with the other parameters itimplies. That is, if (αβ)ij is found significant, then the αi’s and βj’s must be in the model,whether significant or not. Testing all the potential models against each other can be timeconsuming (although software these days can do it in seconds), and can also end up withmore than one model. In the balanced case, it is much easier, because each parameter canbe tested on its own.

11.1.1 Orthogonal spaces in the balanced case

In the balanced case, the saturated model space can be decomposed into orthogonal spaces,each one corresponding to a set of effects. In the two-way balanced ANOVA, we have

MA×B = M∅ + MA·∅ + MB·∅ + M(A×B)·(A+B), (11.18)

where all the spaces on the right-hand side are orthogonal. The corresponding equation forthe three-way balanced model is

MA×B×C = M∅ + MA·∅ + MB·∅ + MC·∅

+ M(A×B)·(A+B) + M(A×C)·(A+C) + M(B×C)·(B+C)

+ M(A×B×C)·(A×B+B×C+A×C). (11.19)

11.2. EXAMPLE 129

Again, the spaces on the right are all orthogonal. It is easiest to see by finding the projectionmatrices, which are given using Kronecker products with H’s in the relevant slots, and J’selsewhere. That is,

dfM∅ = Ja ⊗ Jb ⊗ Jc ⊗ JN 1MA·∅ = Ha ⊗ Jb ⊗ Jc ⊗ JN a − 1MB·∅ = Ja ⊗ Hb ⊗ Jc ⊗ JN b − 1MC·∅ = Ja ⊗ Jb ⊗ Hc ⊗ JN c − 1

M(A×B)·(A+B) = Ha ⊗ Hb ⊗ Jc ⊗ JN (a − 1)(b − 1)M(A×C)·(A+C) = Ha ⊗ Jb ⊗ Hc ⊗ JN (a − 1)(c − 1)M(B×C)·(B+C) = Ja ⊗ Hb ⊗ Hc ⊗ JN (b − 1)(c − 1)

M(A×B×C)·(A×B+B×C+A×C) = Ha ⊗ Hb ⊗ Hc ⊗ JN (a − 1)(b − 1)(c − 1)M⊥

A×B×C = Ia ⊗ Ib ⊗ Ic ⊗ HN n − abc(11.20)

These matrices can be found using the techniques for the two-way model. The degreesof freedom are easy to find by taking traces, recalling that traces of a Kronecker productmultiply, and that trace(Jk) = 1 and trace(Hk) = k − 1.

From this table, it is easy to construct the ANOVA table. If all effects are fixed, thenthe F ’s all use the MSE in the denominator. (The MS column is left out.)

Source Sum of squares Degrees of freedom FMain A effect ‖y

A·∅‖2 a − 1 MSA/MSE

Main B effect ‖yB·∅‖2 b − 1 MSB/MSE

Main C effect ‖yC·∅‖2 c − 1 MSC/MSE

AB interaction ‖y(A×B)·(A+B)

‖2 (a − 1)(b − 1) MSAB int/MSE

AC interaction ‖y(A×C)·(A+C)

‖2 (a − 1)(c − 1) MSAC int/MSE

BC interaction ‖y(B×C)·(B+C)

‖2 (b − 1)(c − 1) MSBC int/MSE

ABC interaction ‖y(A×B×C)·(A×B+B×C+A×C)

‖2 (a − 1)(b − 1)(c − 1) MSABC int/MSE

Error ‖y − yA×B×C

‖2 n − abc —

Total ‖y − y∅‖2 n − 1 —

11.2 Example

Snedecor and Cochran (Statistical Methods) has an example on food supplements for pigs’corn. The factors are A: Lysine at a = 4 levels, B: Methionine at b = 3 levels, and C:Soybean Meal at c = 2 levels. Here, N = 1, so we will assume there is no three-wayinteraction, and use ‖y − y

A×B+B×C+A×C‖2 for the error sum of squares. Measured is the

average weight gain Next is the ANOVA table, along with the relevant F cutoff point:


Source SS df MS F Fdf,6,0.05

Lysine 213.28 3 71.09 1.25 4.757Methionine 262.77 2 131.39 2.30 5.143Soy Meal 2677.59 1 2677.59 46.92 5.987Lysine×Methionine Int. 1271.81 6 211.97 3.71 4.284Lysine×Soy Meal Int. 1199.53 3 399.84 7.01 4.757Methionine×Soy Meal Int. 410.81 2 205.41 3.60 5.143Error 342.44 6 57.07 —Total 6378.24 23 — —

Looking at the two-way interactions, only the Lysine×Soy Meal interaction is significant.Of the main effects, only the Soy Meal is significant, and it is very significant. But, becauseLysine does appear in a significant interaction, we have to include the Lysine main effect.Methionine does not appear in any significant effect, so we can drop it. Thus the best modelappears to include Lysine, Soy Meal, and their interaction,

yijk = µ + αi + γk + (αγ)ik + eijk. (11.21)

Date post:	18-Jan-2016
Category:	Documents
Upload:	fpttmm
View:	18 times
Download:	2 times

Anova

Documents