Lecture 2: Linear Models - University of...

1

Lecture 2:Linear Models

Bruce Walsh lecture notesSISG -Mixed Model Course

version 28 June 2012

2

Quick Review of the Major Points

The general linear model can be written as

y = X! + e• y = vector of observed dependent values

• X = Design matrix: observations of the variables in the assumed linear model

• ! = vector of unknown parameters to estimate

• e = vector of residuals (deviation from model fit), e = y-X !

3

y = X! + eSolution to ! depends on the covariance structure(= covariance matrix) of the vector e of residuals

• OLS: e ~ MVN(0, "2 I)• Residuals are homoscedastic and uncorrelated, so that we can write the cov matrix of e as Cov(e) = "2I• the OLS estimate, OLS(!) = (XTX)-1 XTy

Ordinary least squares (OLS)

• GLS: e ~ MVN(0, V)• Residuals are heteroscedastic and/or dependent,• GLS(!) = (XT V-1 X)-1 V-1 XTy

Generalized least squares (GLS)

4

BLUE

• Both the OLS and GLS solutions are alsocalled the Best Linear Unbiased Estimator(or BLUE for short)

• Whether the OLS or GLS form is useddepends on the assumed covariancestructure for the residuals– Special case of Var(e) = "e

2 I -- OLS

– All others, i.e., Var(e) = R -- GLS

5

y = µ + !1x1 + !2x2 + · · · + !nxx + e

Linear ModelsOne tries to explain a dependent variable y as a linearfunction of a number of independent (or predictor)variables.

A multiple regression is a typical linear model,

Here e is the residual, or deviation between the truevalue observed and the value predicted by the linearmodel.

The (partial) regression coefficients are interpretedas follows. A unit change in xi while holding allother variables constant results in a change of !i in y

6

Linear Models

As with a univariate regression (y = a + bx + e), the modelparameters are typically chosen by least squares,wherein they are chosen to minimize the sum ofsquared residuals, eTe = # ei

2

This unweighted sum of squared residuals assumes an OLS error structure, so all residuals are equallyweighted (homoscedastic) and uncorrelated

If the residuals differ in variances and/or some arecorrelated (GLS conditions), then we need to minimize the weighted sum eTV-1e, which removes correlations andgives all residuals equal variance.

7

Predictor and Indicator Variables

yij = µ + si + eij

yij = trait value of offspring j from sire i

µ = overall mean. This term is included to give the si terms a mean value of zero, i.e., they are expressed as deviations from the mean

si = The effect for sire i (the mean of its offspring). Recallthat variance in the si estimates Cov(half sibs) = VA/4

eij = The deviation of the jth offspring from the familymean of sire i. The variance of the e’s estimates thewithin-family variance.

Suppose we measuring the offspring of p sires. One linear model would be

8

Predictor and Indicator VariablesIn a regression, the predictor variables aretypically continuous, although they need not be.

yij = µ + si + eij

Note that the predictor variables here are the si, (thevalue associated with sire i) something that we are trying to estimate

We can write this in linear model form by using indicatorvariables,

xik =!

1 if sire k = i

0 otherwise

9

Models consisting entirely of indicator variablesare typically called ANOVA, or analysis of variancemodels

Models that contain no indicator variables (other thanfor the mean), but rather consist of observed value ofcontinuous or discrete values are typically calledregression models

Both are special cases of the General Linear Model

(or GLM)

yijk = µ + si + dij + !xijk + eijk

Example: Nested half sib/full sib design with an age correction ! on the trait

10

yijk = µ + si + dij + !xijk + eijk

ANOVA model

Regression model

Example: Nested half sib/full sib design with an age correction ! on the trait

si = effect of sire idij = effect of dam j crossed to sire ixijk = age of the kth offspring from i x j cross

11

Linear Models in Matrix Form

Suppose we have 3 variables in a multiple regression,with four (y,x) vectors of observations.

In matrix form, y = X! + e

y =

"#$

y1

y2

y3

y4

%&' ! =

"#

µ!1

!2

!3

%&$ ' e =

"

#$

e1

e2

e3

e4

%

&'X =

"#$

1 x11 x12 x13

1 x21 x22 x23

1 x31 x32 x33

1 x41 x42 x43

%&'

The design matrix X. Details of both the experimentaldesign and the observed values of the predictor variablesall reside solely in X

yi = µ + !1xi1 + !2xi2 + !3xi3 + ei

12

The Sire Model in Matrix Form

The model here is yij = µ + si + eij

Consider three sires. Sire 1 has 2 offspring, sire 2 hasone and sire 3 has three. The GLM form is

y =

"

#####

y11

y12

y21

y31

y32y33

%

&&&&& , X =

"

#####

1 1 0 01 1 0 01 0 1 01 0 0 11 0 0 11 0 0 1

%

&&&&&$ ' $ '! =

"

#$

µs1s2s3

%

&' , and e =

"

#####$

e11

e12

e21e31e32e33

%

&&&&&'

Still need to specify covariance structure of theresiduals. For example, could be common-familyeffects that make some correlated.

13

In-class ExerciseSuppose you measure height and sprint speed forfive individuals, with heights (x) of 9, 10, 11, 12, 13and associated sprint speeds (y) of 60, 138, 131, 170, 221

1) Write in matrix form (i.e, the design matrixX and vector ! of unknowns) the following models

• y = bx• y = a + bx• y = bx2

• y = a + bx + cx2

2) Using the X and y associated with these models,compute the OLS BLUE, ! = (XTX)-1XTy for each

14

Rank of the design matrix• With n observations and p unknowns, X is an n x p

matrix, so that XTX is p x p

• Thus, at most X can provide unique estimates forup to p < n parameters

• The rank of X is the number of independent rowsof X. If X is of full rank, then rank = p

• A parameter is said to be estimable if we canprovide a unique estimate of it. If the rank of Xis k < p, then exactly k parameters are estimable(some as linear combinations, e.g. !1-3!3 = 4)

• if det(XTX) = 0, then X is not of full rank

• Number of nonzero eigenvalues of XTX gives therank of X.

15

Experimental design and X• The structure of X determines not only which

parameters are estimable, but also the expectedsample variances, as Var(!) = k (XTX)-1

• Experimental design determines the structure ofX before an experiment (of course, missing dataalmost always means the final X is different formthe proposed X)

• Different criteria used for an optimal design. LetV = (XTX)-1 . The idea is to chose a design for Xgiven the constraints of the experiment that:– A-optimality: minimizes tr(V)

– D-optimality: minimizes det(V)

– E-optimality: minimizes leading eigenvalue of V

16

Ordinary Least Squares (OLS)When the covariance structure of the residuals has acertain form, we solve for the vector ! using OLS

If the residuals are homoscedastic and uncorrelated,"2(ei) = "e

2, "(ei,ej) = 0. Hence, each residual is equallyweighted,

Sum of squaredresiduals canbe written as

Predicted value of the y’s

If residuals follow a MVN distribution, OLS = ML solution

n(

i=1

)e2i = )eT)e = (y!X!)T (y!X!)

17

Ordinary Least Squares (OLS)

Taking (matrix) derivatives shows this is minimized by

This is the OLS estimate of the vector !

The variance-covariance estimate for the sample estimatesis

V! = (XTX)!1"2e

n(

i=1

)e2i = )eT)e = (y!X!)T (y!X!)

! = (XT X)!1XT y

The ij-th element gives the covariance between theestimates of !i and !j.

18

Sample Variances/Covariances

*"2e =

1n! rank(X)

n(

i=1

)e2i

V! = (XTX)!1"2e

The residual variance can be estimated as

The estimated residual variance can be substituted into

To give an approximation for the sampling variance and covariances of our estimates.

Confidence intervals follow since the vector of estimates ~ MVN(!, V!)

19

Example: Regression Through the Origin

yi = !xi + ei

Here X =

"##$

x1

x2...

xn

%&&' y =

"

##$

y1

y2

...yn

%

&&' ! = (! )

XTX =n(

i=1

x2i XT y =

n(

i=1

xi yi

+!=,XTX

-!1XT y =

+xi yi

x2i

++"2(!) =

1n !1

(yi! !xi)2

x2i

"

(

2(b) =,XTX

-!1"2

e ="2

e+x2

i

"2e =

1n 1 (yi !xi)2-

-

20

Polynomial Regressions

GLM can easily handle any function of the observedpredictor variables, provided the parameters to estimateare still linear, e.g. Y = $ +!1f(x) + !2g(x) + … + e

Quadratic regression:

yi = # + !1 xi + !2 x2i + ei

! =

"

$#!1!2

%

' X =

"

##$

1 x1 x21

1 x2 x22

......

...1 xn x2

n

%

&&'

21

Interaction EffectsInteraction terms (e.g. sex x age) are handled similarly

With x1 held constant, a unit change in x2 changes yby !2 + !3x1 (i.e., the slope in x2 depends on the currentvalue of x1 )

Likewise, a unit change in x1 changes y by !1 + !3x2

yi = # +!1 xi1 + !2 xi2 + !3 xi1xi2 +ei

! =

"#$

#!1

!2

!3

%&' X =

"

##$

1 x11 x12 x11x121 x21 x22 x21x22...

......

...1 xn1 xn2 xn1xn2

%

&&'

22

The GLM lets you build yourown model!

• Suppose you want a quadratic regressionforced through the origin where the slopeof the quadratic term can vary over thesexes

• Yi = !1xi + !2xi2 + !3sixi

2

• si is an indicator (0/1) variable for the sex(0=male, 1=female).– Male slope = !2,

– Female slope = !2 + !3

23

Fixed vs. Random EffectsIn linear models are are trying to accomplish two goals:estimation the values of model parameters and estimateany appropriate variances.

For example, in the simplest regression model,y = $ + !x + e, we estimate the values for $ and ! andalso the variance of e. We, of course, can alsoestimate the ei = yi - ($ + !xi )

Note that $/! are fixed constants are we trying toestimate (fixed factors or fixed effects), while theei values are drawn from some probability distribution(typically Normal with mean 0, variance "2

e). The ei are random effects.

24

“Mixed” models contain both fixed and random factors

This distinction between fixed and random effects isextremely important in terms of how we analyzed a model.If a parameter is a fixed constant we wish to estimate,it is a fixed effect. If a parameter is drawn fromsome probability distribution and we are trying to makeinferences on either the distribution and/or specific realizations from this distribution, it is a random effect.

We generally speak of estimating fixed factors andpredicting random effects.

y = Xb + Zu + e, u ~MVN(0,R), e ~ MVN(0,"2eI)

Key: need to specify covariance structures for MM

25

Example: Sire model

yij = µ + si + eij

It depends. If we have (say) 10 sires, if we are ONLYinterested in the values of these particular 10 sires and don’t care to make any other inferences about the population from which the sires are drawn, then we can treat them as fixed effects. In the case, the model is fully specified the covariance structure for the residuals.Thus, we need to estimate µ, s1 to s10 and "2

e, and wewrite the model as yij = µ + si + eij, "2(e) = "2

e I

Here µ is a fixed effect, e a random effect

Is the sire effect s fixed or random ?

26

yij = µ + si + eij

Conversely, if we are not only interested in these10 particular sires but also wish to make someinference about the population from which theywere drawn (such as the additive variance, since "2

A = 4"2s, ), then the si are random effects. In this

case we wish to estimate µ and the variances"2

s and "2e. Since 2si also estimates (or predicts)

the breeding value for sire i, we also wish toestimate (predict) these as well. Under a Random-effects interpretation, we write the model asyij = µ + si + eij, "2(e) = "2

eI, "2(s) = "2AA

27

Generalized Least Squares (GLS)

Suppose the residuals no longer have the samevariance (i.e., display heteroscedasticity). Clearlywe do not wish to minimize the unweighted sumof squared residuals, because those residuals withsmaller variance should receive more weight.

Likewise in the event the residuals are correlated,we also wish to take this into account (i.e., performa suitable transformation to remove the correlations)before minimizing the sum of squares.

Either of the above settings leads to a GLS solutionin place of an OLS solution.

28

In the GLS setting, the covariance matrix for thevector e of residuals is written as R where Rij = "(ei,ej)

The linear model becomes y = X! + e, cov(e) = R

The GLS solution for ! is

The variance-covariance of the estimated model parameters is given by

, -b = XT R!1X

!1

XTR!1y

-( )Vb = XT

R!1X

1"2

e

29

ExampleOne common setting is where the residuals are uncorrelatedbut heteroscedastic, with "2(ei) = "2

e/wi.

For example, sample i is the mean value of ni individuals, with "2(ei) = "e

2/ni. Here wi = ni

Consider the model yi = $ + !xi

R = Diag(w!11 , w!1

2 , . . . , w!1n )

! =.

#!

/ X =

"

$1 x1...

...1 xn

%

'

Here

giving

where

-R 1 = Diag(w1, w2, . . . ,wn)

XTR!1y = w

"

$yw

xyw

%

' -

XTR!1X = w

"

$1 xw

xw x2w

%

'

w =n(

i=1

wi, xw =n(

i=1

wixi

w, x2w =

n(

i=1

wix2i

w

yw =n(

i=1

wiyi

w, xyw =

n(

i=1

wixi yi

w

This gives the GLS estimators of $ and ! as

Likewise, the resulting variances and covariance forthese estimators is

!

a = yw ! bxw

b =xyw ! xw yw

x2w x 2

w

-

!

"2(b) ="2

e

w (x2w ! x2

w )

"(a, b) =!"2

e xw

w (x2w ! x2w )

"2(a) ="2

e · x2w

w (x2w x2w )

32

Chi-square and F distributions

Let Ui ~ N(0,1), i.e., a unit normal

The sum U12 + U2

2 + … + Uk2 is a chi-square random

variable with k degrees of freedom

Under appropriate normality assumptions, thesums of squares that appear in linear modelsare also chi-square distributed. In particular,

The ratio of two chi-squares is an F distribution

n(

i=1

(xi ! x ) " $2n!1

2

33

$2k/k

$2n/n

" Fk,n

In particular, an F distribution with k numeratordegrees of freedom, and n denominator degreesof freedom is given by

Thus, F distributions frequently arise in testsof linear models, as these usually involve ratiosof sums of squares.

34

Sums of Squares in linear models

The total sums of squares SST of a linear modelcan be written as the sum of the error (or residual)sum of squares and the model (or regression) sum of squares

SST = SSM + SSE

((yi ! y)2 )

((yi ! yi)2 )

((yi ! y)2�

r2, the coefficient of determination, is thefraction of variation accounted for by the model

r2 =SSM

SST

= 1 - SSE

SST

35

))SSE =n(

i=1

( yi ! yi )2 =n(

i=1

e 2i

SST =n(

i=1

( yi!y )2 =n(

i=1

y2i!y 2 =

n(

i=1

y2i!

1n2

0n(

i=1

yi

12

SST = yT y! 1n

yT Jy = yT

.I! 1

nJ/

y

SSE = yT

.I !X

,XT X

-!1

XT

/y

SSM = SST ! SSE = yT

.X

,XT X

-!1

XT ! 1n

J/

y

Sums of Squares are quadratic products

We can write this as a quadratic product as

Where J is a matrix all of whose elements are 1’s

36

Expected value of sums ofsquares

• In ANOVA tables, the E(MS), or expectedvalue of the Mean Squares (scaled SS orSum of Squares), often appears

• This directly follows from the quadraticproduct. If E(x) = µ, Var(x) = V, then

– E(xTAx) = tr(AV) + µTAµ

37

Hypothesis testing

Provided the residual errors in the model are MVN, then for a modelwith n observations and p estimated parameters,

SSE

"2e

" $2n!p

Consider the comparison of a full (p parameters)and reduced (q < p) models, where SSEr = error SS forreduced model, SSEf = error SS for full model

.SSEr ! SSEf

p! q

/ 2.SSEf

n! p

/=

.n! p

p! q

/ .SSEr

SSEf

! 1/

The difference in the error sum of squares for the full andreduced model provided a test for whether the model fit is thesame

This ratio follows an Fp-q,n-p distribution

38

Does our model account for a significant fraction ofthe variation?

.n! p

p! 1

/ .SST

SSEf

! 1/

=.

n! p

p! 1

/ .r2

1! r2

/

Here the reduced model is just yi = u + ei

In this case, the error sum of squares for thereduced model is just the total sum of squaresand the F test ratio becomes

This ratio follows an Fp-1,n-p distribution

39

Model diagnostics

• It’s all about the residuals

• Plot the residuals– Quick and easy screen for outliers

• Test for normality among estimatedresiduals– Q-Q plot

– Wilk-Shapiro test

– If non-normal, try transformations, such as log

40

OLS, GLS summary

41

Different statistical models

• GLM = general linear model– OLS ordinary least squares: e ~ MVN(0,cI)

– GLS generalized least squares: e ~ MVN(0,R)

• Mixed models– Both fixed and random effects (beyond the residual)

• Mixture models– A weighted mixture of distributions

• Generalized linear models– Nonlinear functions, non-normality

42

Mixture models• Under a mixture model, an observation potentially

comes from one of several different distributions,so that the density function is %1&1 + %2&2 + %3&3

– The mixture proportions %i sum to one

– The &i represent different distribution, e.g., normal withmean µi and variance "2

• Mixture models come up in QTL mapping -- anindividual could have QTL genotype QQ, Qq, or qq– See Lynch & Walsh Chapter 13

• They also come up in codon models of evolution, werea site may be neutral, deleterious, or advantageous,each with a different distribution of selectioncoefficients– See Walsh & Lynch (volume 2A website), Chapters 10,11

43

Generalized linear models

Typically assume non-normal distribution forresiduals, e.g., Poisson, binomial, gamma, etc

44

Different methods of analysis

• Parameters of these various models can beestimated in a number of frameworks

• Method of moments– Very little assumptions about the underlying distribution.

Typically, the mean of some statistic has an expectedvalue of the parameter

– OLS & GLS examples. We only need the assumption onthe covariance structure of the residuals and finitemoments.

– While estimation does not require distributionassumptions, confidence intervals and hypothesis testingdoes

• Distribution-based estimation– The explicit form of the distribution used

45

Distribution-based estimation

• Maximum likelihood estimation– MLE

– REML

– More in Lynch & Walsh Appendix 3

• Bayesian– Marginal posteriors

– Conjugating priors

– MCMC/Gibbs sampling

– More in Walsh & Lynch Appendices 2,3

46

Maximum Likelihood p(x1,…, xn | ' ) = density of the observed data (x1,…, xn)given the (unknown) distribution parameter(s) '

Fisher suggested the method of maximum likelihood ---given the data (x1,…, xn) find the value(s) of ' thatmaximize p(x1,…, xn | ' )

We usually express p(x1,…, xn | ') as a likelihood function l ( ' | x1,…, xn ) to remind us that it is dependenton the observed data

The Maximum Likelihood Estimator (MLE) of ' are the value(s) that maximize the likelihood function l giventhe observed data x1,…, xn .

47

l (' | x)

MLE of '

The curvature of the likelihood surface in the neighborhoodof the MLE informs us as to the precision of the estimatorA narrow peak = high precision. A board peak = lowerprecision

This is formalize by looking at the log-likelihood surface,L = ln [l (' | x) ]. Since ln is a monotonic function, thevalue of ' that maximizes l also maximizes L

The larger the curvature, the smaller the variance

Var(MLE) = -1 %2 L(µ| z)% µ2

'

48

Likelihood Ratio testsHypothesis testing in the ML frameworks occursthrough likelihood-ratio (LR) tests

For large sample sizes (generally) LR approaches aChi-square distribution with r df (r = number ofparameters assigned fixed values under null)

LR = 2 ln.

&()!r |z)&()! | z)

/= 2

3L()!r | z)!L()! | z)

4

'r is the MLE under the restricted conditions (someparameters specified, e.g., var =1)

( r is the MLE under the unrestricted conditions (noparameters specified)

49

Likelihoods for GLMsUnder assumption of MVN, x ~ MVN(! ,V), the likelihood function becomes

= (2') n/2 |V|!1/2 exp5! 1

2(x! )T V!1 (x ! !)

6-L(!,V | x) !

Variance components (e.g., "2A, "2

e, etc.) are included in V

REML = restricted maximum likelihood. Method ofchoice for variance components, as it maximizesthat part of the likelihood function that is independentof the fixed effects, !.

50

Bayesian StatisticsAn extension of likelihood is Bayesian statistics

p(' | x) = C * l(x | ') p(')

Instead of simply estimating a point estimate (e.g.., the MLE), the goal is the estimate the entire distribution for the unknown parameter ' given the data x

p(' | x) is the posterior distribution for ' given the data x

l(x | ') is just the likelihood function

p(') is the prior distribution on '.

51

Bayesian Statistics

Why Bayesian?

• Exact for any sample size

• Marginal posteriors

• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

Priors quantify the strength of any prior information.Often these are taken to be diffuse (with a highvariance), so prior weights on ' spread over a widerange of possible values.

52

Marginal posteriors• Often times were are interested in a particular

set of parameters (say some subset of the fixedeffects). However, we also have to estimate all ofthe other parameters.

• How do uncertainties in these nuisance parametersfactor into the uncertainty in the parameters ofinterest?

• A Bayesian marginal posterior takes this intoaccount by integrating the full posterior over thenuisance parameters

• While this sound complicated, easy to do withMCMC.

53

Conjugating priors

For any particular likelihood, we can often find aconjugating prior, such that the product of the likelihood and the prior returns a known distribution.

Example: For the mean µ in a normal, taking theprior on the mean to also be normal returns aposterior for µ that is normal.

Example: For the variance "2 in a normal, taking theprior on the variance to an inverse chi-square distribution returns a posterior for "2 that is alsoan inverse chi-square (details in WL Appendix 2).

54

A normal prior on the mean with mean µ0 andvariance "0

2 (larger "0

2, more diffuse the prior

If the likelihood for the mean is a normal distribution,the resulting posterior is also normal, wit

Note that if "02 is large, the mean of the posterior is

very close to the sample mean.

55

If x follows a Chi-square distribution, then 1/x followsan inverse chi-square distribution.

The scaled inverse chi-square has two parameters, allowingmore control over the mean and variance of the prior

56

MCMC

Analytic expressions for posteriors can be complicated,but the method of MCMC (Markov Chain MonteCarlo) is a general approach to simulating draws for just about any distribution (details in WL Appendix 3).

Generating several thousand such draws from the posterior returns an empirical distribution that wecan use.

For example, we can compute a 95% credible interval,the region of the distribution that containing 95% ofthe probability.

57

Gibbs Sampling• A very powerful version of MCMC is the Gibbs Sampler

• Assume we are sampling from a vector of parameters,but that the marginal distribution of each parameter isknown

• For example, given a current value for all the fixedeffects (but one, say !1) and the variances, conditioningon these values the distribution of !1 is a normal, whoseparameters are now functions of the current values ofthe other parameters. A random draw is thengenerated from this distribution.

• Likewise, conditioning on all the fixed effects and allvariances but one, the distribution of this variance is aninverse chi-square

58

This generates one cycle of the sampler. Using thesenew values, a second cycle is generated.

Full details in WL Appendix 3.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Lecture 2: Linear Models - University of...

Documents