Download - Multiple comparisons Basics of linear regression Model selection.

• Multiple comparisons

• Basics of linear regression

• Model selection

Multiple comparison

• One comparison - use t-test or equivalent

• Few comparisons - use Bonferroni

• Many comparisons - Tukey’s honest significant differences, Holm, Scheffe or others

Bonferroni correctionIf there is only one comparison then we can use t-test or intervals based on t

distribution. However if the number of tests increases then probability that significant effect will be observed when there is no significant effect becomes very large. It can be calculated using 1-(1-)n, where is significance level and n is the number of comparisons. For example if the significance level is 0.05 and the number of comparisons (tests) is 10 then the probability that at least one significant effect will be detected by chance is 1-(1-0.05)10=0.40. Bonferroni suggested using /n instead of for designing simultaneous confidence intervals. It means that the intervals will be calculated using

i-j t/(2n)(se of comparison)0.5

Clearly when n becomes very large these intervals will become very conservative. Bonferroni correction is recommended when only few effects are compared. pairwise.t.test in R can do Bonferroni correction.

If Bonferoni correction is used then p values are multiplied by the number of comparisons (Note that of we are testing effects of I levels of factor then the number of comparisons is I(I-1)/2

Bonferroni correction: ExampleLet us take the example dataset - poisons and try Bonferroni correction for each

factor:

pairwise.t.test(poison,treat,”none”,data=poisons)

1 2

2 0.32844 -

3 3.3e-05 0.00074

pairwise.t.test(poison,treat,”bonferroni”,data=poisons)

1 2

2 0.9853 -

3 1e-04 0.0022

As it is seen each p-value is multiplied by the number of comparisons 3*2/2 = 3. If the corresponding adjusted p-value becomes more than one then it is truncated to one.

It says that there are significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant.

Holm correction

Another correction for multiple tests – Holm’s correction is less conservative than Bonferroni correction. It is a modification of Bonferroni correction. It works in a sequential manner.

Let us say we need to make n comparisons and significant level is . Then we calculate p values for all of them and sort them in ascending order and apply the following procedure:

• set i = 1

• If (n-i+1) pi< then it is significant, otherwise it is not.

• If a comparison number i is significant then increment i by one and if i n go to the step 2

The number of significant effects will be equal to i where the procedure stops.

When reporting p-values Holm correction works similar to Bonferroni but in a sequential manner. If we have m comparisons then the smallest p value is multiplied by m, the second smallest is multiplied by m-1, j-th comparison is multiplied by m-j+1

Holm correction: example

Let us take the example - the data set poisons and try Holm correction for each factor:

pairwise.t.test(poison,treat,”none”,data=poisons)

1 2

2 0.32844 -

3 3.3e-05 0.00074

pairwise.t.test(poison,treat,”holm”,data=poisons) # this correction is the default in R

1 2

2 0.3284 -

3 1e-04 0.0015The smallest is multiplied by 3 the second by 2 and the largest by 1

It says that there is significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant.

Tukey’s honest significant difference

This test is used to calculate simultaneous confidence intervals for differences of all effects.

Tukey’s range distribution. If we have a random sample of size N from normal distribution then the distribution of stundentised range - (maxi(xi)-mini(xi))/sd is called Tukey’s distribution.

Let us say we want to test if i- j is 0. For simultaneous 100% confidence intervals we need to calculate for all pairs lower and upper limits of the interval using:

difference ± ql, sd (1/Ji+1/Jj)0.5 /2

Where q is the -quantile of Tukey’s distribution, Ji and Jj are the numbers of observations used to calculate i and j, sd is the standard deviation, l is the number of levels to be compared and is the degree of freedom used to calculate sd.

Tukey’s honest significant differenceR command to perform this test is TukeyHSD. It takes an object derived using aov as an input

and gives confidence intervals for all possible differences. For example for poison data (if you want to use this command you should use aov for analysis)

lm1 = aov(time~poison+treat,data=poisons)TukeyHSD(lm1)$poison diff lwr upr p adj2-1 -0.073125 -0.2089936 0.0627436 0.3989657 # insignifacnt3-1 -0.341250 -0.4771186 -0.2053814 0.0000008 # significant3-2 -0.268125 -0.4039936 -0.1322564 0.0000606 # significant

$treat diff lwr upr p adjB-A 0.36250000 0.18976135 0.53523865 0.0000083 #sginificantC-A 0.07833333 -0.09440532 0.25107198 0.6221729 #insgnificD-A 0.22000000 0.04726135 0.39273865 0.0076661 #significantC-B -0.28416667 -0.45690532 -0.11142802 0.0004090 #significantD-B -0.14250000 -0.31523865 0.03023865 0.1380432 #insignificD-C 0.14166667 -0.03107198 0.31440532 0.1416151 #insignific

More tests

Mann-Whitney-Wilcoxon test – non-parametric test for median

Kolmogorov-Smirnoff test – test if distribution of two random variables are same

Grubbs test – tests for outliers (more in R package – outliers)

Basics of linear regression analysis

• Purpose of linear models

• Least-squares solution for linear models

• Analysis of diagnostics

Reason for linear models

Purpose of regression is to reveal statistical relations between input and output variables. Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics can help validation of various functional relationship (models). Let us assume that we suspect that functional relationship has a form:

where is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and y is output, is an error associated with the experiment. Then we can set up experiments for various values of x and get output (or response) for them. If the number of experiments is n then we will have n output values. Denote them as a vector y=(y1,y2,,,yn). Purpose of statistics is to evaluate parameter vector using input and output values. If function f is a linear function of the parameters and errors are additive then we are dealing with linear model. For this model we can write

Linear model is linearly dependent on parameters but not on input variables. For example

is a linear model. But

is not.

Assumptions

Basic assumptions for analysis of linear model are:

1) the model is linear in parameters

2) the error structure is additive

3) Random errors have 0 mean, equal variances and they are uncorrelated.

These assumptions are sufficient to deal with linear models. Uncorrelated with equal variance assumptions (number 3) can be removed. Then the treatments becomes a little bit more complicated.

Note that for general solution, normal distribution of errors assumption is not used. This assumption is necessary to design test statistics. If this assumption breaks down then we can use bootstrap or other approximate techniques to design test statistic.

These assumptions can be written in a vector form:

where y, 0, are vectors and I and X are matrices. The matrix X is called a design matrix, input matrix etc. I is nxn identity matrix.

Solution for the general caseProblem of linear model fitting into the data is reduced to mimimisation:

Its solution is:

This estimation is unbiased:

Covariance matrix of the estimator is:

And variance of the error is estimated as:

Note that correlation does not depend on the data directly, it only depends on the amount of data and how it has been collected (design of experiment is important).

Standard diagnosticsBefore starting to model

1) Visualisation of the data: 1) plotting predictor vs observations. These plots may give a clue about the relationship,

outliers

2) Smootheners

After modelling and fitting

2) Fitted values vs residuals. It may help to identify outliers, correctness of the model

3) Normal QQ plot of residuals. It may help to check distribution assumptions

4) Cook’s distance, reveals outliers, checks correctness of the model

5) Model assumptions - t tests given by the default print of lm

Checking model and designing tests

3) Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model

Visualisation prior to modellingDifferent type of datasets may require different visualisation tools. For simple

visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modelling may help to propose a model (form of the functional relationship between input and output, probability distribution of observation etc)

For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

After modelling: linear modelsAfter modelling the results should be analysed. For example

attach(women)

lm1 = lm(weight~height)

It means that we want a liner model (we believe that dependence of weight on height is linear):

weight=0+1*height

Results could be viewed using

lm1

summary(lm1)

The last command will produce significance of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

After modelling: linear modelsIt is a good idea to plot data and fitted model, and differences between fitted and

observed values on the same graph. For linear models with one predictor it can be done using:

plot(weight,height)

abline(lm1)

segements(weight,fitted(lm1),weight,height)

This plot already shows some systematic

differences. It is an indication that model

may need to be revised.

Checking validity of the model: standard toolsPlotting fitted values vs residual, QQ plot and Cook’s distance can give some insight

into model and how to improve it. Some of these plots can be done using

plot(lm1)

Confidence and prediction bands

Confidence bands are those where true line would belong with 1-α probability

Prediction bands are bands where future observation would full with probability of 1-α:

Prediction bands are wider than confidence bands.

Prediction and confidence bandslm1 = lm(height~weight)

pp = predict(lm1,interval='p')

pc = predict(lm1,interval='c')

plot(weight,height,ylim=range(height,pp))

n1=order(weight)

matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')

matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red’)

These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

Analysis of diagnosticsResiduals and hat matrix: Residuals are differences between observation and fitted

values:

H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these values are close to one then that fitted value is determined by this observation. Sometimes hi’=hi/(1-hi) is used to enhance high leverages.

Q-Q plot can be used to check normality assumption. If assumption on distribution is correct then this plot should be nearly linear. If the distribution is normal then tests designed for normal distributions can be used. Otherwise bootstrap may be more useful to derive the desired distributions.

Analysis of diagnostics: Cont.Other analysis tools include:

Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is

unbiased estimator of 2 after removal of i-th observation

R-squared

One of the statistics used for goodness of fit is R-squared. It is defined as:

Adjusted R-squared:

n - the number of observations, p - the number of parameters

Several potential problems with linear model and simple regression analysis

1) InterpretationRegression analysis can tell us that there is a relationship between input and output. It does not

say one causes another

2) OutliersUse robust M-estimators

3) Singular or near singular casesRidge regression or SVD.

In underdetermined systems Partial least squares may be helpful

4) Distribution assumptionsGeneralised linear models

Random or mixed effect models

Data transformation

5) Non-linear modelBox-Cox transformation

Non-linear model fitting for example using R command nlm

Model Selection

• Model selection based on t-distribution

• Information criteria

• Cross-validation

IntroductionWhen modelling relationship between input and out variables we need to make decision

about the form of the model. If there are several input parameters then modelling decision could be: which input variable is important in predicting the output. Let us say that we have several input variables and the proposed model is linear:

y=β0 + β1 x1+β2 x2 + ε

If some of the coefficients are 0 then corresponding input variable has no effect on predicting response – y.

Another situation may arise when we want to explain output using polynomials on input variables.

y=β0 + β1 x1+β2 x12 + ε

In these cases we may want to know if some of the coefficients are 0. E.g. if β2 is 0 then relationship between input and output is linear. Otherwise relationship may be higher order polynomials. In general if we have functions of increasing complexity gk then it is likely that functions of higher complexity will give us better agreement between input and output. One of the ways of calculating agreement is residual sum of square (RSS):

RSS = Σ(yi-gk(xi))2

Where i is the observation number and k is the function (model) number.

IntroductionOne very simple way of testing if a particular coefficient of the linear model is 0 is

testing hypothesis:

H0:βj=0 versus H1:βj≠0

As we already know, to do this test we need to know the distribution of the estimated parameters. We know that for large samples the distribution of the parameters asymptotically is normal. We also know that covariance matrix of this distribution is:

Where all xi0 are 1. The number of observations is n and the number of parameters is p. Under H0 the distribution of βj is normal distribution with mean 0 and the variance equal to the diagonal term of the covariance matrix. Therefore the distribution of

is t-distribution with n-p degrees of freedom.

ExampleLet us take an example of weight vs height and look at the results of linear model fitting using

polynomials on input variable (height) up to fifth order. Plots show that second order polynomial fits sufficiently well. After the second order improvements are marginal. Natural question would be which model is better, can we justify more complex model based on the observations we have.

Model selection using t-distribution: summary of lm method

The distribution of estimated parameters are used in the summary of lm command. Let us say we want to see if some of the coefficients of the model:

y=β0 + β1 x1+β2 x12 + β1 x1

3 +β2 x1

4 + β1 x15 + ε

After fitting the model using lm5 = lm(weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5))

summary(lm5) will produce a table that will have information about significance for each estimated parameter:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.144e+04 8.300e+04 1.102 0.299

height -6.947e+03 6.418e+03 -1.083 0.307

I(height^2) 2.107e+02 1.982e+02 1.063 0.315

I(height^3) -3.187e+00 3.058e+00 -1.042 0.325

I(height^4) 2.403e-02 2.355e-02 1.020 0.334

I(height^5) -7.224e-05 7.246e-05 -0.997 0.345

Residual standard error: 0.2245 on 9 degrees of freedom

Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998

F-statistic: 1.335e+04 on 5 and 9 DF, p-value: < 2.2e-16

Model selection using t-distributionMore details of this table will be described in the tutorial. Here we are interested in not-

rejecting null-hypothesis (that particular parameter is 0). If we are using t-distribution to to make such decision then we need to take the largest p-value and remove the corresponding coefficient from the model. In this case it is the coefficient corresponding to height5. After removing that we can do model fitting using the reduced model. We use:

lm4 = lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))

summary(lm4)

It will produce significance levels for coefficientsCoefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.824e+03 4.550e+03 1.939 0.0812 .

height -5.552e+02 2.814e+02 -1.973 0.0768 .

I(height^2) 1.319e+01 6.515e+00 2.024 0.0705 .

I(height^3) -1.389e-01 6.693e-02 -2.076 0.0646 .

I(height^4) 5.507e-04 2.574e-04 2.140 0.0581 .

Now we cannot remove any of the coefficients, p-values are not large enough to say that particular coefficient is insignificant.

Model selection using t-distributionsummary command after lm command allows a simple way of model selection. It

would work as follows:

1) Specify full model (it may be a problem. Model always can be made more complex)

2) Fit the model using lm

3) Use summary and remove the coefficient with the highest p-value (Say values that are more than 0.1. This value is arbitrary and depends how much would one want to reduce the model)

4) Repeat 2) and 3) until model can no longer be reduced

Warning: Using multiple hypothesis testing may produce misleading results. For example if we are working on a significance level α and we want to test 10 hypothesis then the probability that at least one of the hypothesis will be significant when it is not is 1-(1-α)10. If α=0.05 then this probability will be 0.4 and if we are testing 100 hypothesis then this probability will be 0.99. When we used summary we deliberately did not use rejection of null-hypothesis. We used high p-values not to reject null-hypothesis.

Akaike’s information criterionAnother model selection tool is Akaike’s information criterion (AIC). This technique is

based on the value of likelihood at the maximum and uses the number of parameters to penalise complex models. Let us remind us the maximum likelihood technique. Likelihood is proportional to the conditional distribution of observations (outputs, responses) given parameters of the model:

In maximum likelihood estimation we maximise the likelihood function wrt the parameters of the model. By making model more and more complex we can get larger and larger likelihood values. But is more complex model justified?

Once we have maximised the likelihood function we can calculate the value of the likelihood at the maximum. Akaike’s information criterion as implemented in R uses the following formula:

Where c is a constant (may depend on the number of observations) that defines various forms of information criterion, p is the effective number of parameters in the model. In linear model p is equal to the number of parameters. When c=2 then we have the classic AIC and when c=log(n) then we have Bayesian information criterion (BIC).

Akaike’s information criterionTo apply AIC to the cases of linear models we need to specify likelihood function and

find its value at the maximum. We can consider linear models with least square minimisation as a special case of maximum likelihood.

Where σι is the standard deviation of the i-th observation. When all σι are equal to each other (denote them as σ) and they are unknown then we can write:

Once we have found the maximum wrt β we can carry on and find the maximum wrt σ.

Note that the maximum likelihood estimator for the variance is not unbiased. Unbiased estimator is

Akaike’s information criterionLet us find the value of the likelihood at the maximum. We will have:

If we put it in the expression of AIC we will have:

In model selection the number of observations do not change, so AIC can be written:

If we use more complex model then RSS will be smaller. On the other hand model with smaller number of parameters may have better predictive power. AIC attempts to combine these two conflicting ideas together.

When comparing two models, model with smaller AIC is considered to be better.

Akaike’s information criterion in R

There are number of functions in R that uses AIC. The simplest of them is extractAIC. This function gives the number of parameters as well as AIC. Another function is step. In its simplest form this function removes one at a time a parameter and calculates the AIC. If removal of a parameter decreases AIC then this parameter is removed from further consideration. The procedure is applied iteratively. In the following slide a result of step function is shown.

Akaike’s information criterionStart: AIC=-40.48

weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5)

Df Sum of Sq RSS AIC

- I(height^5) 1 0.050 0.504 -40.910

- I(height^4) 1 0.052 0.506 -40.840

- I(height^3) 1 0.055 0.508 -40.773

- I(height^2) 1 0.057 0.510 -40.708

- height 1 0.059 0.513 -40.646

<none> 0.454 -40.482

Step: AIC=-40.91

weight ~ height + I(height^2) + I(height^3) + I(height^4)

Df Sum of Sq RSS AIC

<none> 0.504 -40.910

- height 1 0.196 0.700 -37.979

- I(height^2) 1 0.206 0.710 -37.759

- I(height^3) 1 0.217 0.721 -37.536

- I(height^4) 1 0.231 0.734 -37.256

Call:

lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))

Coefficients:

(Intercept) height I(height^2) I(height^3) I(height^4)

8.824e+03 -5.552e+02 1.319e+01 -1.389e-01 5.507e-04

Stage 1: Procedure removes one parameter at a time. Removal of height5 reduces AIC

Stage 2: Procedure tries further reduce the number of parameters. But the “best” model is the full model.

Final stage: The “best” model is given

Akaike’s information criterion: further improvements

There are several additional forms of AIC. They try to correct to small samples. For example corrected AIC has an expression:

R does not produce AICc (I am not aware if it produces). However it can be calculated using (for example)

extractAIC(lm4)+2*p*(p+1)/(n-p-1)

Where n is the number of observations and p is the number of parameters. For example for fourth order polynomial fit for weight <–> height relationship we can use:

lm4=lm(weight ~ height + I(height^2) + I(height^3) + I(height^4))

extractAIC(lm4)+2*5*6/(15-5-1)

It will produce the value -34.24 and for the reduced model

lm3=lm(weight ~ height + I(height^2) + I(height^3))

extractAIC(lm3)+2*4*5/(15-4-1)

It will produce -33.26. Since more complex model has lower AICc, it is preferred.

AIC, BIC

Both AIC and BIC attempt to produce minimal best model that describes the observations as best as possible. BIC reduces (prefers simpler model) more than AIC. In general in applications it is better to use common sense and look at the parameters very carefully. If you can find interpretation of all terms and reasons for them being there then they may be valid. Even when the “best” model has been selected it would be better to come back to model selection again if new observations become available.

Absolute values of AIC (or BIC) are not important. Differences between AIC-s for different models are important.

AIC (and BIC) can be used to compare models with the same observations. If observations change (for example few outliers are removed or new observations become available) then AIC should be applied for all models again.

Cross-validationIf we have a sample of observations, can we use this sample and choose among given models.

Cross validation attempts to reduce overfitting thus helps model selection.

Description of cross-validation: We have a sample of the size n – (yi,xi) .• Divide the sample into K roughly equal size parts.• For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate

prediction error for k-th part.• Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation

prediction error.If K=n then we will have leave-one-out cross-validation technique. Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of

the sample be Ak and number of points in this subset is Nk.. Then prediction error per observation is:

Then we would choose the function that gives the smallest prediction error. We can expect that in future when we will have new observations this function will give smallest prediction error.

This technique is widely used in modern statistical analysis. It is not restricted to least-squares technique. Instead of least-squares we could could use other techniques such as maximum-likelihood, Bayesian estimation, M-estimation.

Cross-validation in R

There is no cross-validation method for lm in R but there is a cross-validation function for glm (generalised linear method). This method requires that resoponses are stored in the output from model fitting. We can change a little bit the output from lm and then use cross-validation for glm. Following sequence of commands will allow us to do this (it is just an example):

lm1 = lm(weight~height)

lm1$y = weight # Add responses to the output from lm

cv.glm(lm1)$delta # Do cross-validation

cv.glm function by default uses leave one out cross-validation.

Cross-validation in RLet us apply cross-valdiation for weight~height relationship.

require(boot)

require(MASS)

lm1=lm(weight~height,data=women)

lm1$y=women$weight

cv.glm(women,lm1)$delta

This sequence of command will produce the following. So prediction error is 3.04.

1 1

3.040776 3.002450

lm1=lm(weight~height+I(height^2),data=women)

lm1$y=women$weight

cv.glm(women,lm1)$delta

It produces

1 1

0.2198339 0.2156849

Prediction error is reduced to 0.22. Further application will produce prediction errors 0.15, 0.13, 0.18. The minimum seems to be when polynomial of fourth order is used. However differences between third and fourth order is not as substantial as that between first and second order. Depending on circumstances we would choose models as polynomila of second, third or fourth order.

R commands

lm – model fit

summary (or summary.lm) – will give summary about the model including p-values for each model parameter

extractAIC – to extract Akaikes information criterion

extractAIC – with k=log(n) as an argument it will produce Bayesian information criterion

stepAIC (or step) – will use AIC to step by step model comparision

add1 – will add one parameter at a time and calculate AIC

drop1 – will drop one parameter at a time and calculate AIC

cv.glm – from the library boot. It is suitable for the objects produced by glm, however slight modification allows it to be used for lm object also.

References

1. Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”

2. Stuart, A., Ord, JK, and Arnold, S. (1991) “Kendall’s advanced Theory of statistics. Volume 2A. Classical Inference and the Linear models” Arnold publisher, London, Sydney, Auckland

3. Dalgaard, P (2002) “Introductory statistics with R”, Springer Publishers