+ All Categories
Home > Documents > Multiple Regression SPECIALISTICA

Multiple Regression SPECIALISTICA

Date post: 05-Apr-2018
Category:
Upload: alchemistbg
View: 217 times
Download: 0 times
Share this document with a friend

of 93

Transcript
  • 7/31/2019 Multiple Regression SPECIALISTICA

    1/93

    Multiple Linear Regression

    Model

    Statistical methods and applications (2009) Multiple Linear Regression Model 1 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    2/93

    Outline

    2 Introduction

    2 The multiple linear regression model

    2 Underlying assumptions

    2 Parameter estimation and hypothesis testing

    2 Residual diagnostics

    2 Goodness of fit and model selection

    2 Examples in R

    Statistical methods and applications (2009) Multiple Linear Regression Model 2 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    3/93

    Multiple linear regression model

    Multiple linear regression represents a generalization, to more than a single

    explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or

    dependent) variable y and several explanatory variables x1, x2, . . . , xp, usuallyassumed to be known or under the control of the investigator, not regardedas random variable at all. In practice, the observed values of the explanatory

    variables are subject to random variation, like the response variable.

    Multiple vs. multivariate regressionIn multivariate regression more than one dependent variable is available.

    Variables do not arise on an equal footing

    This do not imply that some variables are more important than others (though they maybe), rather that there are dependent and explanatory variables (also called predictor,controlled or independent variables).

    Statistical methods and applications (2009) Multiple Linear Regression Model 3 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    4/93

    Multiple linear regression model

    Multiple linear regression represents a generalization, to more than a single

    explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or

    dependent) variable y and several explanatory variables x1, x2, . . . , xp, usuallyassumed to be known or under the control of the investigator, not regardedas random variable at all. In practice, the observed values of the explanatory

    variables are subject to random variation, like the response variable.

    Multiple vs. multivariate regressionIn multivariate regression more than one dependent variable is available.

    Variables do not arise on an equal footing

    This do not imply that some variables are more important than others (though they maybe), rather that there are dependent and explanatory variables (also called predictor,controlled or independent variables).

    Statistical methods and applications (2009) Multiple Linear Regression Model 3 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    5/93

    Multiple linear regression model

    Multiple linear regression represents a generalization, to more than a single

    explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or

    dependent) variable y and several explanatory variables x1, x2, . . . , xp, usuallyassumed to be known or under the control of the investigator, not regardedas random variable at all. In practice, the observed values of the explanatory

    variables are subject to random variation, like the response variable.

    Multiple vs. multivariate regressionIn multivariate regression more than one dependent variable is available.

    Variables do not arise on an equal footing

    This do not imply that some variables are more important than others (though they maybe), rather that there are dependent and explanatory variables (also called predictor,controlled or independent variables).

    Statistical methods and applications (2009) Multiple Linear Regression Model 3 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    6/93

    Multiple linear regression model

    Multiple linear regression represents a generalization, to more than a single

    explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or

    dependent) variable y and several explanatory variables x1, x2, . . . , xp, usuallyassumed to be known or under the control of the investigator, not regardedas random variable at all. In practice, the observed values of the explanatory

    variables are subject to random variation, like the response variable.

    Multiple vs. multivariate regressionIn multivariate regression more than one dependent variable is available.

    Variables do not arise on an equal footing

    This do not imply that some variables are more important than others (though they maybe), rather that there are dependent and explanatory variables (also called predictor,controlled or independent variables).

    Statistical methods and applications (2009) Multiple Linear Regression Model 3 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    7/93

    Multiple linear regression model

    Multiple linear regression represents a generalization, to more than a single

    explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or

    dependent) variable y and several explanatory variables x1, x2, . . . , xp, usuallyassumed to be known or under the control of the investigator, not regardedas random variable at all. In practice, the observed values of the explanatory

    variables are subject to random variation, like the response variable.

    Multiple vs. multivariate regressionIn multivariate regression more than one dependent variable is available.

    Variables do not arise on an equal footingThis do not imply that some variables are more important than others (though they maybe), rather that there are dependent and explanatory variables (also called predictor,controlled or independent variables).

    Statistical methods and applications (2009) Multiple Linear Regression Model 3 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    8/93

    Details on the model

    Let yi be the value of the response variable on the ith individual, and

    xi1, xi2, . . . , xip be the ith individuals values on p explanatory variables.

    The multiple linear regression model is then given by

    yi = 0 + 1xi1 + . . . + pxip + i, i = 1, . . . , n,

    where the residual or error terms i, i = 1, . . . , n are assumed to be independentrandom variables having a normal distribution with mean zero and constantvariance 2.

    As a consequence, the distribution of the random response variable is also normal

    (y N(, 2)) with expected value , given by:

    = E(y|x1, x2, . . . , xp) = 0 + 1x1 + . . . + pxp,

    and variance 2.

    Statistical methods and applications (2009) Multiple Linear Regression Model 4 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    9/93

    About the term linear

    The parameters k, k = 1, 2, . . . , p are known as regression coefficients and givethe amount of change in the response variable associated with a unit change inthe corresponding explanatory variable, conditional on the other explanatoryvariables in the model remaining unchanged.

    NoteThe term linear in multiple linear regression refers to the regression parameters,not to the response or explanatory variables. Consequently models in which, forexample, the logarithm of a response variable is modeled in terms of quadraticfunctions of some of the explanatory variables would be included in this class ofmodels.An example of nonlinear model is y1 = 1e

    2xi1 + 3e4xi2 + i.

    Statistical methods and applications (2009) Multiple Linear Regression Model 5 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    10/93

    Writing the model using matrices and vectors

    The multiple regression model may be written using the matricial representation:y = X + ,

    where y = [y1, y2, . . . , yn], = [0, 1, . . . , p],

    = [1, 2, . . . , n], and

    X =

    1 x11 x12 . . . x1p1 x21 x22 . . . x2p...

    ......

    ......

    1 xn1 xn2 . . . xnp

    Each row in X (sometimes known as the design matrix) represents the values ofthe explanatory variables for one of the individuals in the sample, with theaddition of unity to take into account of the parameter 0.

    Statistical methods and applications (2009) Multiple Linear Regression Model 6 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    11/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    12/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    13/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    14/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    15/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

  • 7/31/2019 Multiple Regression SPECIALISTICA

    16/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

    Cl i l i f i l i

  • 7/31/2019 Multiple Regression SPECIALISTICA

    17/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

    Cl i l i f i l i

  • 7/31/2019 Multiple Regression SPECIALISTICA

    18/93

    Classical assumptions for regression analysis

    It is assumed that the relationship between variables is linear.

    The design matrix X must have full rank (p+ 1). Otherwise the parametervector will not be identified, at most we will be able to narrow down itsvalue to some linear subspace ofRp+1. For this property to hold, we musthave n > p, where n is the sample size.

    The independent variables are assumed to be error-free, that is they are notcontaminated with measurement errors.

    The error term is assumed to be a random variable with a mean of zeroconditional on the explanatory variables.

    The errors are uncorrelated (no serial correlation) with constant variance(homoscedasticity).

    The predictors must be linearly independent, i.e. it must not be possible toexpress any predictor as a linear combination of the others (otherwisemulticollinearity problems arise).

    These are sufficient (but not all necessary) conditions for the least-squaresestimator to possess desirable properties, such as unbiasedness, consistency, and

    efficiency.Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

    Wh d l i l li i l k lik ?

  • 7/31/2019 Multiple Regression SPECIALISTICA

    19/93

    What does multiple linear regression look like?

    While simple linear regression allows to draw a straight line that best fit the data(in the least squares sense) in the (x, y) plane, multiple linear draws a plane that

    best fits the cloud of data point in the in a (p+ 1)-dimensional space. Actually,with p predictors, the regression equation

    yi = 0 + 1xi1 + 2xi2 + . . . + pxip, i = 1, 2, . . . , n,

    defines a p-dimensional hyperplane in a (p+ 1)-dimensional space, that minimizes

    the sum of the squares of the distances (measure parallel to the y axis) betweenthe hyperplane and the data points.

    Statistical methods and applications (2009) Multiple Linear Regression Model 8 / 63

    P ti f LS ti t

  • 7/31/2019 Multiple Regression SPECIALISTICA

    20/93

    Properties of LS estimators

    Gauss-Markov theorem

    In a linear model in which the errors have expectation zero and are uncorrelatedand have equal variances, the Best Linear Unbiased Estimators (BLUE) of thecoefficients are the Least-Squares (LS) estimators.

    More generally, the BLUE estimator of any linear combination of the coefficientsis its least-squares estimator.

    It is noteworthy that the errors are not assumed to be normally distributed, norare they assumed to be independent (but only uncorrelated, a weaker condition),nor are they assumed to be identically distributed (but only homoscedastic, aweaker condition).

    Statistical methods and applications (2009) Multiple Linear Regression Model 9 / 63

    P t ti ti

  • 7/31/2019 Multiple Regression SPECIALISTICA

    21/93

    Parameters estimation

    The Least-Squares (LS) procedure is used to estimate the parameters in themultiple regression model.

    Assuming that XX is nonsingular, hence it can be inverted, then the LS estimatorof the parameter vector is

    = (XX)1Xy.

    This estimator has the following properties:

    E() = ,

    and

    cov() =

    2

    (X

    X)

    1

    .

    The diagonal elements of the matrix cov() give the variances of the j, whereas

    the off-diagonal elements give the covariances between pairs j, k. The square

    roots of the diagonal elements of the matrix are thus the standard errors of the j.

    Statistical methods and applications (2009) Multiple Linear Regression Model 10 / 63

    In details

  • 7/31/2019 Multiple Regression SPECIALISTICA

    22/93

    In details

    How to obtain the LS estimator of the parameter vector One method of estimation of population parameters is ordinary least squares. Thismethod allows to find the vector that minimizes the sum of squared residuals,i.e. the function G() given by

    G() = e

    e = (y X)

    (y X).

    It follows thatG() = yy + (XX) 2Xy.

    Minimization of this function results in a set of p normal equations, which aresolved to yield the parameter estimators. The minimum is then found by settingthe gradient to zero:

    0 = G() = 2Xy + 2(XX) = = (XX)1Xy.

    Statistical methods and applications (2009) Multiple Linear Regression Model 11 / 63

    Variance table

  • 7/31/2019 Multiple Regression SPECIALISTICA

    23/93

    Variance table

    The regression analysis can be assessed using the following analysis of variance(ANOVA) table.

    Table: ANOVA table

    Source of Sum of Squares Degrees of Freedom Mean SquareVariation (SS) (df)

    Regression SSR=Pn

    i=1(yi y)2 p MSR=SSR/p

    Residual SSE =Pn

    i=1(yi yi)2 n p 1 MSE=SSE/(n-p-1)

    Total SST =Pn

    i=1(yi y)2 n 1

    where yi is the predicted value of the response variable for the ith individual and yis the mean value of the response variable.

    Statistical methods and applications (2009) Multiple Linear Regression Model 12 / 63

    Estimation and testing parameters

  • 7/31/2019 Multiple Regression SPECIALISTICA

    24/93

    Estimation and testing parameters

    The means square ratio MSR/MSE provides an F-test for the general

    hypothesis H0 : 0 = 1 = . . . = p = 0.

    An estimate of2 is provided by s2 given by

    s2 = 1n p 1

    ni=1

    (yi yi)2.

    Under H0, the mean square ratio has an F-distribution with p, n p 1

    degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation

    coefficient R2, related to the F-test, that gives the proportion of variance ofthe response variable accounted for by the explanatory variables. We willdiscuss it in more detail later on.

    Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

    Estimation and testing parameters

  • 7/31/2019 Multiple Regression SPECIALISTICA

    25/93

    Estimation and testing parameters

    The means square ratio MSR/MSE provides an F-test for the general

    hypothesis H0 : 0 = 1 = . . . = p = 0.

    An estimate of2 is provided by s2 given by

    s2 = 1n p 1

    ni=1

    (yi yi)2.

    Under H0, the mean square ratio has an F-distribution with p, n p 1

    degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation

    coefficient R2, related to the F-test, that gives the proportion of variance ofthe response variable accounted for by the explanatory variables. We willdiscuss it in more detail later on.

    Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

    Estimation and testing parameters

  • 7/31/2019 Multiple Regression SPECIALISTICA

    26/93

    Estimation and testing parameters

    The means square ratio MSR/MSE provides an F-test for the general

    hypothesis H0 : 0 = 1 = . . . = p = 0.

    An estimate of2 is provided by s2 given by

    s2 = 1n p 1

    ni=1

    (yi yi)2.

    Under H0, the mean square ratio has an F-distribution with p, n p 1degrees of freedom.

    On the basis of the ANOVA table, we may calculate the multiple correlationcoefficient R2, related to the F-test, that gives the proportion of variance ofthe response variable accounted for by the explanatory variables. We willdiscuss it in more detail later on.

    Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

    Estimation and testing parameters

  • 7/31/2019 Multiple Regression SPECIALISTICA

    27/93

    Estimation and testing parameters

    The means square ratio MSR/MSE provides an F-test for the general

    hypothesis H0 : 0 = 1 = . . . = p = 0.

    An estimate of2 is provided by s2 given by

    s2 = 1n p 1

    ni=1

    (yi yi)2.

    Under H0, the mean square ratio has an F-distribution with p, n p 1degrees of freedom.

    On the basis of the ANOVA table, we may calculate the multiple correlationcoefficient R2, related to the F-test, that gives the proportion of variance ofthe response variable accounted for by the explanatory variables. We willdiscuss it in more detail later on.

    Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

    Estimation and testing parameters

  • 7/31/2019 Multiple Regression SPECIALISTICA

    28/93

    Estimation and testing parameters

    The means square ratio MSR/MSE provides an F-test for the general

    hypothesis H0 : 0 = 1 = . . . = p = 0.

    An estimate of2 is provided by s2 given by

    s2 = 1n p 1

    ni=1

    (yi yi)2.

    Under H0, the mean square ratio has an F-distribution with p, n p 1degrees of freedom.

    On the basis of the ANOVA table, we may calculate the multiple correlationcoefficient R2, related to the F-test, that gives the proportion of variance ofthe response variable accounted for by the explanatory variables. We willdiscuss it in more detail later on.

    Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

    Test on a single parameter

  • 7/31/2019 Multiple Regression SPECIALISTICA

    29/93

    Test on a single parameter

    Individual regression coefficients can be assessed by using the ratio j/SE(j),

    although these ratios should only be used as rough guides to the significance orotherwise of the coefficients.Under the null hypothesis H0 : j = 0, we have that

    j

    SE(j)

    tnp1.

    The problem of using t-statistics

    t-statistics are conditional on which explanatory variables are included in thecurrent model.As a consequence, the values of these statistic will change, as will the values ofthe estimated regression coefficients and their standard errors as other variablesare included and excluded from the model.

    Statistical methods and applications (2009) Multiple Linear Regression Model 14 / 63

    About nested models

  • 7/31/2019 Multiple Regression SPECIALISTICA

    30/93

    About nested models

    Two models are nested if both contain the same terms and one has at least oneadditional term. For example, the model (a)

    y = 0 + 1x1 + 2x2 + 3x1x2 +

    is nested within model (b)

    y = 0 + 1x1 + 2x2 + 3x1x2 + 4x21 + 5x

    22 +

    Model (a) is the reduced model and model (b) is the full model.

    In order to decide whether the full model is better than the reduced one (i.e, does

    it contribute additional information about the association between y and thepredictors?), we have to test the hypothesis H0 : 4 = 5 = 0 against thealternative that at least one additional term is = 0.

    Statistical methods and applications (2009) Multiple Linear Regression Model 15 / 63

    Testing Nested Models

  • 7/31/2019 Multiple Regression SPECIALISTICA

    31/93

    Testing Nested Models

    We indicate with SSER the SSE in the reduced model and with SSEC the SSE forthe complete model.

    Since it is always true that SSER > SSEC, the question becomes: Is the drop inSSE from fitting the complete model is large enough?.

    In order to compare a model with k parameters (reduced) with another one withk + q parameters (complete or full), hence in order to verify the hypotheses

    H0 : k+1 = k+2 = . . . = k+q = 0 against H1 : At least one = 0

    an F test, defined as

    F =(SSER SSEC)/# of additional s

    SSEC/ [n (k + q+ 1)],

    is used.Once chosen an appropriate level, if F F,1,2 , with 1 = q and2 = n (k + q+ 1), then H0 is rejected.

    Statistical methods and applications (2009) Multiple Linear Regression Model 16 / 63

    Checking model assumptions: residuals and other

  • 7/31/2019 Multiple Regression SPECIALISTICA

    32/93

    g pregression diagnostic

    In order to complete a regression analysis, it is needed to check assumptions suchas those of constant variance and normality of the error terms, since violation ofthese assumptions may invalidate conclusions based on the regression analysis.

    Diagnostic plots generally used when assessing model assumptions are discussedbelow.

    Residuals versus fitted values. If the fitted model is appropriate, the plottedpoints should lie in an approximately horizontal band across the plot.Departures from this appearance may indicate that the functional form of theassumed model is incorrect, or alternatively, that there is inconstant variance.

    Residuals versus explanatory variables. Systematic patterns in these plots can

    indicate violations of the constant variance assumption or an appropriatemodel form.

    Normal probability plot of the residuals. The plot checks the normaldistribution assumptions on which all statistical inference procedures arebased.

    Statistical methods and applications (2009) Multiple Linear Regression Model 17 / 63

    Another regression diagnostic

  • 7/31/2019 Multiple Regression SPECIALISTICA

    33/93

    g g

    Plot of the Cooks distancesA further diagnostic that is often very useful is an index plot of the Cooksdistances for each observation.This statistic is defined as follows:

    Dk =

    1

    (1 + p)s2

    n

    i=1

    yi(k) yi

    2,

    where yi(k) is the fitted value of the ith observation when the kth observation isomitted from the model.The values of Dk determine the influence of the kth observation on the estimated

    regression coefficients (cause for concern).Values of Dk greater than one are suggestive that the corresponding observationhas undue influence on the estimated regression coefficients.

    Statistical methods and applications (2009) Multiple Linear Regression Model 18 / 63

    Principle of parsimony

  • 7/31/2019 Multiple Regression SPECIALISTICA

    34/93

    p p y

    In science, parsimony is the preference for the least complex explanation for anobservation. One should always choose the simplest explanation of a phenomenon,the one that requires the fewest leaps of logic (Burnham and Anderson, 2002).

    William of Occam suggested in the 14th century that one shave away all that isunnecessary, an aphorism known as the Occams razor.

    Albert Einstein is supposed to have said: Everything should be made as simple as

    possible, but no simpler.

    According to Box and Jenkins (1970), the principle of parsimony should lead to amodel with the smallest number of parameters for adequate representation of thedata.

    Statisticians view the principle of parsimony as a bias versus variance tradeoff.Usually, it happens that bias (of parameters estimates) decreases and variance (ofparameters estimates) increases as the dimension (number of parameters) of themodel increases.All model selection methods are based on the principle of parsimony.

    Statistical methods and applications (2009) Multiple Linear Regression Model 19 / 63

    Multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    35/93

    p

    As previously said, the relation SST = SSR + SSE holds true.The squared multiple correlation coefficient is given by

    R2 =SSR

    SST=

    ni=1(yi y)

    2ni=1(yi y)

    2= 1

    ni=1(yi yi)

    2ni=1(yi y)

    2.

    It is the proportion of variability in a data set that is accounted for by the

    statistical model. It gives a measure of the strength of the association between the independent

    (explanatory) variables and the one dependent variable.

    It can be any value from 0 to +1. The closer R2 is to one, the stronger thelinear association is. If it is equal zero, then there is no linear association

    between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is

    F =R2/p

    (1 R2)/(n p 1).

    Statistical methods and applications (2009) Multiple Linear Regression Model 20 / 63

    Multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    36/93

    As previously said, the relation SST = SSR + SSE holds true.The squared multiple correlation coefficient is given by

    R2 =SSR

    SST=

    ni=1(yi y)

    2ni=1(yi y)

    2= 1

    ni=1(yi yi)

    2ni=1(yi y)

    2.

    It is the proportion of variability in a data set that is accounted for by the

    statistical model. It gives a measure of the strength of the association between the independent

    (explanatory) variables and the one dependent variable.

    It can be any value from 0 to +1. The closer R2 is to one, the stronger thelinear association is. If it is equal zero, then there is no linear association

    between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is

    F =R2/p

    (1 R2)/(n p 1).

    Statistical methods and applications (2009) Multiple Linear Regression Model 20 / 63

    Multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    37/93

    As previously said, the relation SST = SSR + SSE holds true.The squared multiple correlation coefficient is given by

    R2 =SSR

    SST=

    ni=1(yi y)

    2ni=1(yi y)

    2= 1

    ni=1(yi yi)

    2ni=1(yi y)

    2.

    It is the proportion of variability in a data set that is accounted for by the

    statistical model. It gives a measure of the strength of the association between the independent

    (explanatory) variables and the one dependent variable.

    It can be any value from 0 to +1. The closer R2 is to one, the stronger thelinear association is. If it is equal zero, then there is no linear association

    between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is

    F =R2/p

    (1 R2)/(n p 1).

    Statistical methods and applications (2009) Multiple Linear Regression Model 20 / 63

    Multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    38/93

    As previously said, the relation SST = SSR + SSE holds true.The squared multiple correlation coefficient is given by

    R2 =SSR

    SST=

    ni=1(yi y)2ni=1(yi y)

    2= 1

    ni=1(yi yi)2ni=1(yi y)

    2.

    It is the proportion of variability in a data set that is accounted for by the

    statistical model. It gives a measure of the strength of the association between the independent

    (explanatory) variables and the one dependent variable.

    It can be any value from 0 to +1. The closer R2 is to one, the stronger thelinear association is. If it is equal zero, then there is no linear association

    between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is

    F =R2/p

    (1 R2)/(n p 1).

    Statistical methods and applications (2009) Multiple Linear Regression Model 20 / 63

    Multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    39/93

    As previously said, the relation SST = SSR + SSE holds true.The squared multiple correlation coefficient is given by

    R2 =SSR

    SST=

    ni=1(yi y)2ni=1(yi y)

    2= 1

    ni=1(yi yi)2ni=1(yi y)

    2.

    It is the proportion of variability in a data set that is accounted for by the

    statistical model. It gives a measure of the strength of the association between the independent

    (explanatory) variables and the one dependent variable.

    It can be any value from 0 to +1. The closer R2 is to one, the stronger thelinear association is. If it is equal zero, then there is no linear association

    between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is

    F =R2/p

    (1 R2)/(n p 1).

    Statistical methods and applications (2009) Multiple Linear Regression Model 20 / 63

    Adjusting the multiple correlation coefficient

  • 7/31/2019 Multiple Regression SPECIALISTICA

    40/93

    However, since the value of R2 increases when increasing the number of regressors

    (overestimation problem), it is convenient to calculate the adjusted version of R2

    given by

    R2adj = 1

    ni=1(yi yi)

    2/(n p 1)ni=1(yi y)

    2/(n 1)= 1 (1 R2)

    n 1

    n p 1.

    It derives from the following relations:

    R2 =SSR

    SST= 1

    SSE

    SST.

    R2adj = 1 SSE/(n p 1)

    SST/(n 1)

    ,

    R2adj = 1 (1 R2)SST/(n p 1)

    SST/n= 1 (1 R2)

    n 1

    n p 1.

    Statistical methods and applications (2009) Multiple Linear Regression Model 21 / 63

    Akaikes and Bayesian (or Schwarz) Information Criteria

  • 7/31/2019 Multiple Regression SPECIALISTICA

    41/93

    Log-likelihood for normal distribution is given by

    = n(log(2) + log(SSE/n) + 1)/2.

    If the object under study is the multivariate linear regression model with p

    esplicative variables and n units then

    AIC = 2 + 2K = 2 + 2(p+ 1),

    andBIC = 2 + K log(n) = 2 + (p+ 1) log(n).

    Statistical methods and applications (2009) Multiple Linear Regression Model 22 / 63

    The least squares case (from Burnham and Anderson,)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    42/93

    2002)

    Computing AIC in the least squares case

    If all models in the set assume normally distributed errors with a constant variance,then AIC can be easily computed from least squares regression statistics as

    AIC = n log(

    2

    ) + 2K = n log(

    2

    ) + 2(p+ 1),

    where

    2 =SSE

    n(the MLE of2).

    A common mistake when computing AIC is to take the estimate of2 from the

    computer output, instead of computing the ML estimate above.Moreover K is the total number of estimated regression parameters, including theintercept and 2.

    Statistical methods and applications (2009) Multiple Linear Regression Model 23 / 63

    Mallows Cp statistic

  • 7/31/2019 Multiple Regression SPECIALISTICA

    43/93

    2 Mallows Cp statistic is defined as

    Cp = SSRps2

    (n 2p),

    where SSRp is the residual sum of squares from a regression model with acertain set of p 1 of the explanatory variables, plus an intercept, and s2 isthe estimate of2 from the model that includes all explanatory variables

    under consideration.

    2 Cp is an unbiased estimator of the mean squared prediction error.

    2 If Cp is plotted against p, the subsets of the variables ensuring aparsimonious model are those lying close to the line Cp = p.

    2

    In this plot, the value p is (roughly) the contribution to Cp from the varianceof the estimated parameters, whereas the remaining Cp p is (roughly) thecontribution from the bias of the model.

    2 Cp plot is a useful device for evaluating the Cp values of a range of models(Mallows, 1973, 1995; Burnman, 1996).

    Statistical methods and applications (2009) Multiple Linear Regression Model 24 / 63

    To summarize

  • 7/31/2019 Multiple Regression SPECIALISTICA

    44/93

    Rules of thumbThere are a number of measures based on the entire estimated equation that canbe used to compare two or more alternative specifications and select the best onebased on that specific criterion.

    R2adj higher is better.

    AIC lower is better.

    BIC lower is better.

    Low values of Cp are those that indicate the best model to consider.

    Statistical methods and applications (2009) Multiple Linear Regression Model 25 / 63

    Model selection: three basic approaches

  • 7/31/2019 Multiple Regression SPECIALISTICA

    45/93

    We recall that with p covariates, M = 2p 1 potential models could be obtained.

    Forward selection approach starts with an initial model that contains only aconstant term and successively adds explanatory variables to the model fromthe pool of candidate variables until a stage is reached where none of thecandidate variables, if added to the current model, would contributeinformation that is statistically important concerning the expected value ofthe response. This is generally decided by comparing a function of the

    decrease in the residual sum of squares with a threshold value set by theinvestigator.

    Backward elimination method starts with all the variables in the model, anddrops the least significant, one at a time, until only significant variables areretained.

    Stepwise regression procedure is a mixture of both forward selection andbackward elimination. It starts performing a forward selection, but dropsvariables which become no longer significant after introduction of newvariables.

    Statistical methods and applications (2009) Multiple Linear Regression Model 26 / 63

    Multicollinearity (from Der and Everitt, 2006)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    46/93

    Multicollinearity occurs when variables are so highly correlated that it is difficultto come up with reliable estimates of their individual regression coefficients. It

    leads to the following problems.

    The variances of the regression coefficients result increased. As aconsequence, predicted model are less stable and parameter estimatesbecome inaccurate. The power of significance tests for the regressioncoefficients obviously decreases, thus leading to Type II errors.

    The R2 of the estimated regression will be large, even if all the coefficientsare not significant, since the explanatory variables are largely attempting toexplain much of the same variability in the response variable (see Dizney andGromen, 1967).

    The effects of the explanatory variables are confounded due to their

    intercorrelations, hence it is difficult to determine the importance of a giventexplanatory variable.

    The removal of a single observation may largely affect the calculatedcoefficients.

    Statistical methods and applications (2009) Multiple Linear Regression Model 27 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    47/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    48/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    49/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    50/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    51/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    How to overcome multicollinearity

  • 7/31/2019 Multiple Regression SPECIALISTICA

    52/93

    Examine the correlations between explanatory variables (helpful but notinfallible).

    Evaluate the variance inflation factors of the explanatory variables. The VIFfor the jth variable (i.e., VIFj) is given by

    VIFj = 1/(1 R2j ),

    where R2j is the square of the multiple correlation coefficient from theregression of the jth explanatory variable on the remaining explanatoryvariables. VIFj indicates the strength of the linear relationship between thejth variable and the remaining explanatory variables. A rough rule of thumbis that VIF > 10 give some cause for concern.

    Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables.

    Use more complex approaches, e.g. principal component regression (Jolliffe,1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

    Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

    Polynomial and dummy-variable regression

  • 7/31/2019 Multiple Regression SPECIALISTICA

    53/93

    Polynomial regression is a generalization of linear regression in which therelationship between the independent variable x and dependent variable y is

    modelled as an nth order polynomial.Dummy variables (or indicator variables) are variables that takes the values 0 or 1to indicate the absence or presence of some categorical effect that may beexpected to shift the outcome. They are useful to represent categorical variables(e.g., sex, employment status) and may be entered as regressors in linear

    regression model, giving rise to dummy-variable regression.http://socserv.mcmaster.ca/jfox/Courses/soc740/lecture-5.pdf

    http://www.slideshare.net/lmarsh/dummy-variable-regression

    Cohen, J. (1968) Multiple regression as a general data-analytic systemPsychological Bulletin 70, 426-443.

    Harrell, F. E. (2001) Regression modeling strategies (chapter 2)http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/rms.pdf

    Weisberg, S. (2005) Applied Linear Regression, 3rd edition. John Wiley &Sons, New York.

    Statistical methods and applications (2009) Multiple Linear Regression Model 29 / 63

    Practicing R software (1)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    54/93

    p

  • 7/31/2019 Multiple Regression SPECIALISTICA

    55/93

    first.fitted=0.988514333+0.422516384*1.1-0.001737381*1.2+0.716029046*1.4model$fitted.values[1]second.fitted=0.988514333+0.422516384*2.3-0.001737381*3.4+0.716029046*5.6

    model$fitted.values[2]

    model1

  • 7/31/2019 Multiple Regression SPECIALISTICA

    56/93

    new

  • 7/31/2019 Multiple Regression SPECIALISTICA

    57/93

    This data set concerns with air pollution in the United States. For 41 cities in theUnited States the following variables were recorded.

    SO2: Sulphur dioxide content of air in micrograms per cubic meter

    Temp: Average annual temperature in F

    Manuf: Number of manufacturing enterprises employing 20 or more workers

    Pop: Population size (1970 census) in thousandsWind: Average annual wind speed in miles per hour

    Precip: Average annual precipitation in inches

    Days: Average number of days with precipitation per year

    Air Pollution in the U.S. Cities. From Biometry, 2/E, R. R. Sokal and F. J. Rohlf.Copyright c1969, 1981 by W. H. Freeman and Company.

    Statistical methods and applications (2009) Multiple Linear Regression Model 33 / 63

    Analyses for usair dataset (1)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    58/93

    > library(leaps)

    > library(xtable)> usair attach(usair)

    > Neg.temp newdata attach(newdata)

    > model1 formula(model1)

    SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip + Days

    > names(model1)

    [1] "coefficients" "residuals" "effects" "rank"

    [5] "fitted.values" "assign" "qr" "df.residual"

    [9] "xlevels" "call" "terms" "model"

    > summary(model1)

    > xtable(summary(model1))

    Statistical methods and applications (2009) Multiple Linear Regression Model 34 / 63

    Analyses for usair dataset (2)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    59/93

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 111.7285 47.3181 2.36 0.0241Neg.temp 1.2679 0.6212 2.04 0.0491

    Manuf 0.0649 0.0157 4.12 0.0002Pop -0.0393 0.0151 -2.60 0.0138

    Wind -3.1814 1.8150 -1.75 0.0887Precip 0.5124 0.3628 1.41 0.1669

    Days -0.0521 0.1620 -0.32 0.7500

    #R^2 for model1#Residual standard error: 14.64 on 34 degrees of freedom#Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112#F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07> model1$coefficients

    (Intercept) Neg.temp Manuf Pop Wind Precip111.72848064 1.26794109 0.06491817 -0.03927674 -3.18136579 0.51235896

    Days-0.05205019

    Statistical methods and applications (2009) Multiple Linear Regression Model 35 / 63

    Analyses for usair dataset (3)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    60/93

    For example, fitted value for Atlanta (city #10) is then

    > Atlanta=111.72848064+1.26794109*(-61.5)+0.06491817*368-0.03927674*497+-3.18136579*9.1+0.51235896 *48.34-0.05205019*115> Atlanta[1] 27.95068> model1$fitted.values

    1 2 3 4 5 6 7-3.789143 28.674536 20.542095 28.694105 56.991475 31.367410 22.078815

    8 9 10 11 12 13 146.927136 11.623630 27.950681 110.542989 22.241636 23.270039 8.194604

    15 16 17 18 19 20 2126.111320 20.673981 31.866991 35.121805 45.166185 29.436860 49.432339

    22 23 24 25 26 27 2820.235861 5.760100 31.744282 29.618343 46.003663 59.770345 27.116241

    29 30 31 32 33 34 3559.682401 30.928374 45.242416 21.415035 28.911046 15.931135 7.715774

    36 37 38 39 40 4124.506704 13.806381 31.143835 32.118061 33.717942 33.512572

    Statistical methods and applications (2009) Multiple Linear Regression Model 36 / 63

    Analyses for usair dataset (4)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    61/93

    > anova(model1)> xtable(anova(model1))

    Df Sum Sq Mean Sq F value Pr(>F)Neg.temp 1 4143.33 4143.33 19.34 0.0001Manuf 1 7230.76 7230.76 33.75 0.0000Pop 1 2125.16 2125.16 9.92 0.0034Wind 1 447.90 447.90 2.09 0.1573Precip 1 785.38 785.38 3.67 0.0640Days 1 22.11 22.11 0.10 0.7500

    Residuals 34 7283.27 214.21

    Statistical methods and applications (2009) Multiple Linear Regression Model 37 / 63

    Analyses for usair dataset (5)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    62/93

    > step(model1)Start: AIC=226.37SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip + Days

    Df Sum of Sq RSS AIC- Days 1 22.1 7305.4 224.5 7283.3 226.4- Precip 1 427.3 7710.6 226.7- Wind 1 658.1 7941.4 227.9- Neg.temp 1 892.5 8175.8 229.1- Pop 1 1443.1 8726.3 231.8- Manuf 1 3640.1 10923.4 241.0

    Step: AIC=224.49SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip

    Df Sum of Sq RSS AIC 7305.4 224.5- Wind 1 636.1 7941.5 225.9- Precip 1 785.4 8090.8 226.7- Pop 1 1447.5 8752.9 229.9- Neg.temp 1 1517.4 8822.8 230.2

    - Manuf 1 3636.8 10942.1 239.1Call:lm(formula = SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip)Coefficients:(Intercept) Neg.temp Manuf Pop Wind Precip

    100.15245 1.12129 0.06489 -0.03933 -3.08240 0.41947

    Statistical methods and applications (2009) Multiple Linear Regression Model 38 / 63

    Analyses for usair dataset (6)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    63/93

    > xtable(step(model1))

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 100.1525 30.2752 3.31 0.0022Neg.temp 1.1213 0.4159 2.70 0.0107

    Manuf 0.0649 0.0155 4.17 0.0002Pop -0.0393 0.0149 -2.63 0.0125

    Wind -3.0824 1.7656 -1.75 0.0896Precip 0.4195 0.2162 1.94 0.0605

    #R^2 for the selected model by the stepwise procedure

    #Residual standard error: 14.45 on 35 degrees of freedom

    #Multiple R-squared: 0.6685, Adjusted R-squared: 0.6212#F-statistic: 14.12 on 5 and 35 DF, p-value: 1.409e-07

    > step(model1, direction="backward")

    > step(model1, direction="forward")

    > step(model1, direction="both")

    Statistical methods and applications (2009) Multiple Linear Regression Model 39 / 63

    Analyses for usair dataset (7)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    64/93

    > AIC(model1)

    [1] 344.7232

    > ## a version of BIC or Schwarz BC :> AIC(model1, k = log(nrow(newdata)))

    [1] 358.4318

    > extractAIC(model1)

    [1] 7.0000 226.3703

    > y x leapcp leapcp

    > library(faraway)## caricamento package faraway

    > Cpplot(leapcp)

    > leapadjr maxadjr(leapadjr,8)1,2,3,4,5 1,2,3,4,5,6 1,2,3,4,6 1,2,3,5 1,2,3,4 1,2,3,5,6

    0.621 0.611 0.600 0.600 0.592 0.588

    1,2,3,6 2,3,4,6

    0.588 0.587

    Statistical methods and applications (2009) Multiple Linear Regression Model 40 / 63

    Analyses for usair dataset (8)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    65/93

    3 4 5 6 7

    5.

    0

    5.

    5

    6.

    0

    6.

    5

    7.

    0

    7.

    5

    8.

    0

    p

    Cp

    23

    236

    123

    1235

    1234

    12362346

    2356

    12345

    12346

    12356

    12345

    NoteThe Cp plot shows that the minimum value for Cp index may be found in correspondence of the combination#12345, hence the selected variables to be included in a parsimonious model are Neg.temp, Manuf, Pop, Windand Precip.

    Statistical methods and applications (2009) Multiple Linear Regression Model 41 / 63

    Analyses for usair dataset (9)

    > plot.lm(model1)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    66/93

    p> qqnorm(model1$res)> qqline(model1$res)> shapiro.test(model1$res)

    Shapiro-Wilk normality testdata: model1$resW = 0.923, p-value = 0.008535> hist(model1$res,15)> plot.lm(model1)

    -2 -1 0 1 2

    -20

    -10

    0

    10

    20

    30

    40

    50

    Normal Q-Q Plot

    Theoretical Quantiles

    SampleQuantiles

    Statistical methods and applications (2009) Multiple Linear Regression Model 42 / 63

    Analyses for usair dataset (9)

    > plot.lm(model1)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    67/93

    > qqnorm(model1$res)> qqline(model1$res)> shapiro.test(model1$res)

    Shapiro-Wilk normality testdata: model1$resW = 0.923, p-value = 0.008535> hist(model1$res,15)> plot.lm(model1) Histogram of model1$res

    model1$res

    Frequency

    -20 0 20 40

    0

    2

    4

    6

    8

    Statistical methods and applications (2009) Multiple Linear Regression Model 42 / 63

    Brief note on influential points

    The term refers to points that make a lot of difference in the regression analysis.

  • 7/31/2019 Multiple Regression SPECIALISTICA

    68/93

    They have two characteristics.

    They are outliers, i.e. observations lying outside the overall pattern of a

    distribution. They are often indicative either of measurement error or thatthe model is not appropriate to fit data.

    They are high-leverage points, i.e.they exert a great deal of influence on thepath of the fitted equation, since the values of the x variables are far fromthe mean x.

    How to deal with outliersWhen performing a regression analysis, it is often best to discard outliers beforecomputing the line of best fit. This is particularly true of outliers along the direction,since these points may greatly influence the result.

    With reference to usair data...In usair data set at least one city that should considered an outlier. On the Manufvariable, e.g., Chicago with a value of 3344 has about twice as many manufacturingenterprises employing 20 or more workers than has the city with the second highestnumber (Philadelphia). Philadelphia and Phoenix may also be suspects in this sense.

    Statistical methods and applications (2009) Multiple Linear Regression Model 43 / 63

    Analyses for usair dataset (10)

    > rstand

  • 7/31/2019 Multiple Regression SPECIALISTICA

    69/93

    > plot(rstand, main="Standardized Errors")> abline(h=2)> abline(h=-2)

    > outlier2]> identify(1:length(rstand),rstand, names(rstand)) #we highlight the values of standardized#errors outside of the 95\% confidence interval# of normal distribution, that may be considered#anomalous

    0 10 20 30 40

    -1

    0

    1

    2

    3

    Standardized Errors

    Index

    rstand

    Statistical methods and applications (2009) Multiple Linear Regression Model 44 / 63

    Analyses for usair dataset (11)

    > data3

  • 7/31/2019 Multiple Regression SPECIALISTICA

    70/93

    > attach(data3)> model3 summary(model3)

    > qqnorm(model3$res)> qqline(model3$res)> shapiro.test(residuals(model3))Shapiro-Wilk normality testdata: residuals(model3)W = 0.9723, p-value = 0.4417

    -2 -1 0 1 2

    -20

    -10

    0

    10

    20

    Normal Q-Q Plot

    Theoretical Quantiles

    SampleQuantiles

    Statistical methods and applications (2009) Multiple Linear Regression Model 45 / 63

    Analyses for usair dataset (11)

    > data3 tt h(d t 3)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    71/93

    > attach(data3)> model3 summary(model3)

    > qqnorm(model3$res)> qqline(model3$res)> shapiro.test(residuals(model3))Shapiro-Wilk normality testdata: residuals(model3)W = 0.9723, p-value = 0.4417

    0 10 20 30 40

    -2

    -1

    0

    1

    2

    Standardized Errors

    Index

    rstand3

    Statistical methods and applications (2009) Multiple Linear Regression Model 45 / 63

    Bodyfat: dataset description

  • 7/31/2019 Multiple Regression SPECIALISTICA

    72/93

    A data frame containing the estimates of the percentage of body fatdetermined by underwater weighing and various body circumferencemeasurements for 252 males, ages 21 and 81 (Johnson, 1996).

    Response variable is y = 1/density, like in Burnham and Anderson (2002).

    13 potential predictors are age, weight, height, and 10 body circumferencemeasurements.

    Selection of the best model using AIC, Mallows Cp and adjusted R2 criteriathrough stepwise regression.

    Statistical methods and applications (2009) Multiple Linear Regression Model 46 / 63

    Bodyfat: variables envolved

    1 density Density from underwater weighing (gm/cm3)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    73/93

    1 density Density from underwater weighing (gm/cm )

    2 age Age (years)

    3 weight Weight (lbs)

    4 height Height (inches)

    5 neck Neck circumference (cm)

    6 chest Chest circumference (cm)

    7 abdomen Abdomen circumference (cm)8 hip Hip circumference (cm)

    9 thigh Thigh circumference (cm)

    10 knee Knee circumference (cm)

    11

    ankle Ankle circumference (cm)12 biceps Biceps (extended) circumference (cm)

    13 forearm Forearm circumference (cm)

    14 wrist Wrist circumference (cm)

    Statistical methods and applications (2009) Multiple Linear Regression Model 47 / 63

    Full model

    Estimate Std. Error t value Pr(>|t|)

    ( )

  • 7/31/2019 Multiple Regression SPECIALISTICA

    74/93

    (Intercept) 0.8748 0.0359 24.35 0.0000age 0.0001 0.0001 1.56 0.1198

    weight -0.0002 0.0001 -1.91 0.0580height -0.0002 0.0002 -0.79 0.4314

    neck -0.0009 0.0005 -1.97 0.0504chest -0.0001 0.0002 -0.53 0.5993

    abdomen 0.0020 0.0002 11.43 0.0000hip -0.0005 0.0003 -1.54 0.1261

    thigh 0.0005 0.0003 1.71 0.0892knee 0.0000 0.0005 0.01 0.9893

    ankle 0.0006 0.0005 1.24 0.2144biceps 0.0005 0.0004 1.41 0.1606

    forearm 0.0009 0.0004 2.22 0.0276wrist -0.0036 0.0011 -3.23 0.0014

    > model11$r.squared

    [1] 0.7424321

    > model11$adj.r.squared

    [1] 0.7283632

    Statistical methods and applications (2009) Multiple Linear Regression Model 48 / 63

    Some diagnostic plots

    > qqnorm(model1$res)> li ( d l1$ )

  • 7/31/2019 Multiple Regression SPECIALISTICA

    75/93

    > qqline(model1$res)> shapiro.test(residuals(model1))Shapiro-Wilk normality test

    data: residuals(model1)W = 0.9924, p-value = 0.2232

    -3 -2 -1 0 1 2 3

    -0.

    03

    -0.

    02

    -0.

    01

    0.

    00

    0.

    01

    0.

    02

    Normal Q-Q Plot

    Theoretical Quantiles

    SampleQuantiles

    Statistical methods and applications (2009) Multiple Linear Regression Model 49 / 63

    Some diagnostic plots

    > qqnorm(model1$res)> qqline(model1$res)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    76/93

    > qqline(model1$res)> shapiro.test(residuals(model1))Shapiro-Wilk normality test

    data: residuals(model1)W = 0.9924, p-value = 0.2232

    Histogram of residuals(model1)

    residuals(model1)

    Frequency

    -0.03 -0.02 -0.01 0.00 0.01 0.02

    0

    5

    10

    15

    20

    25

    Statistical methods and applications (2009) Multiple Linear Regression Model 49 / 63

    Some diagnostic plots

    > qqnorm(model1$res)> qqline(model1$res)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    77/93

    > qqline(model1$res)> shapiro.test(residuals(model1))Shapiro-Wilk normality test

    data: residuals(model1)W = 0.9924, p-value = 0.2232

    0 50 100 150 200 250

    -3

    -2

    -1

    0

    1

    2

    Standardized Errors

    Index

    rstand

    Statistical methods and applications (2009) Multiple Linear Regression Model 49 / 63

    Finding the best covariates subset (1)

    > step(model1)> extractAIC(model1)

  • 7/31/2019 Multiple Regression SPECIALISTICA

    78/93

    > extractAIC(model1)[1] 14.00 -2365.27> extractAIC(model2)

    [1] 9.000 -2370.712

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 0.8644 0.0244 35.49 0.0000age 0.0001 0.0001 1.73 0.0848

    weight -0.0002 0.0001 -2.56 0.0111neck -0.0010 0.0005 -2.04 0.0421

    abdomen 0.0020 0.0001 13.30 0.0000hip -0.0004 0.0003 -1.51 0.1320

    thigh 0.0007 0.0003 2.56 0.0109forearm 0.0011 0.0004 2.76 0.0063

    wrist -0.0033 0.0011 -3.09 0.0023

    #Residual standard error: 0.008903 on 243 degrees of freedom# Multiple R-squared: 0.7377, Adjusted R-squared: 0.7291#F-statistic: 85.44 on 8 and 243 DF, p-value: < 2.2e-16

    Statistical methods and applications (2009) Multiple Linear Regression Model 50 / 63

    Finding the best covariates subset (2)

    x

  • 7/31/2019 Multiple Regression SPECIALISTICA

    79/93

    leapcp

    leaps(x, y, method = "r2", nbest = 1) #maxleaps(x, y, method = "adjr2", nbest = 1) #min

    library(faraway)## load library farawayCpplot(leapcp)

    6 8 10 12 14

    9

    1

    0

    1

    1

    12

    13

    1

    4

    15

    p

    Cp

    61213

    61113

    6813

    261112132461213

    2681213

    26101213

    2461113

    26101113

    23612132691213

    12612132681113

    246111213

    26101112131268121324681213

    2681012132681112132461012132691112132678121312461213

    12468121324610111213

    124611121324681112132467812131268101213

    234611121324691112132467111213

    1268111213

    124678121312468111213

    1246810121324678111213124610111213

    126810111213246810111213

    24678101213234610111213246710111213

    1246781112131246810111213

    124678101213

    123467812132467810111213

    1234681112132346781112131245678121312467891213124568111213

    12467810111213

    1234678111213

    12346781012131245678111213123468101112131246789111213124568101112131246891011121312456781012131246789101213

    123467810111213

    124567810111213124678910111213

    12345678111213

    12346789111213123456781012131234678910121312456789111213123456810111213123468910111213

    1234567810111213

    1234678910111213

    1245678910111213

    123456789111213

    123456789101213

    12345689101112132345678910111213

    1234567910111213

    12345678910

    Statistical methods and applications (2009) Multiple Linear Regression Model 51 / 63

    AIC, adjusted R2 and Mallows Cp criteria

  • 7/31/2019 Multiple Regression SPECIALISTICA

    80/93

    Using AIC criterion the best model is

    y~age+weight+neck+abdomen+hip+thigh+forearm+wrist

    AIC= -2370.71

    as we can find in Burnham and Anderson (2002), while using R2 adjustedcriterion the best model is a model with 10 covariates

    y~age+weight+neck+abdomen+hip+thigh+ankle+biceps

    +forearm+wrist

    Finally using Mallows Cp criterion is a model with 8 covariates

    y~age+weight+neck+abdomen+hip+thigh+forearm+wrist

    Statistical methods and applications (2009) Multiple Linear Regression Model 52 / 63

    Pollution: another one lm example

  • 7/31/2019 Multiple Regression SPECIALISTICA

    81/93

    In this data set, CO2 emissions (metric tons per capita) measured in 116 countriesare related to other variables like:

    1 Energy use (kg of oil equivalent per capita)

    2 Export of goods and services (% of GDP)

    3 Gross Domestic Product (GDP) growth (annual %)

    4 Population growth (annual %)

    5 Annual deforestation (% of change)

    6 Gross National Income (GNI), Atlas method (current US$)

    Statistical methods and applications (2009) Multiple Linear Regression Model 53 / 63

    Full model

  • 7/31/2019 Multiple Regression SPECIALISTICA

    82/93

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) -0.2501 0.5731 -0.44 0.6634energy 0.0022 0.0002 14.43 0.0000

    exports 0.0191 0.0099 1.92 0.0568GDPgrowth 0.0036 0.0836 0.04 0.9654

    popgrowth -0.0011 0.1949 -0.01 0.9955

    deforestation -0.0995 0.1615 -0.62 0.5391GNI -0.0001 0.0000 -2.27 0.0254

    > mod11$r.squared

    [1] 0.8059332

    > mod11$adj.r.squared[1] 0.7952506

    Statistical methods and applications (2009) Multiple Linear Regression Model 54 / 63

    Some diagnostic plots

    > qqnorm(mod1$res)> qqline(mod1$res)> h i t t( id l ( d1))

  • 7/31/2019 Multiple Regression SPECIALISTICA

    83/93

    > shapiro.test(residuals(mod1))Shapiro-Wilk normality testdata: residuals(mod1)W = 0.7228, p-value = 1.673e-13# 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland,# Russian Federation, Sudan and Togo

    -2 -1 0 1 2

    -15

    -10

    -5

    0

    5

    Normal Q-Q Plot

    Theoretical Quantiles

    SampleQuantiles

    Statistical methods and applications (2009) Multiple Linear Regression Model 55 / 63

    Some diagnostic plots

    > qqnorm(mod1$res)> qqline(mod1$res)> shapiro test(residuals(mod1))

  • 7/31/2019 Multiple Regression SPECIALISTICA

    84/93

    > shapiro.test(residuals(mod1))Shapiro-Wilk normality testdata: residuals(mod1)W = 0.7228, p-value = 1.673e-13# 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland,# Russian Federation, Sudan and Togo

    Histogram of mod1$res

    mod1$res

    Frequency

    -15 -10 -5 0 5

    0

    10

    20

    30

    40

    50

    Statistical methods and applications (2009) Multiple Linear Regression Model 55 / 63

    Some diagnostic plots

    > qqnorm(mod1$res)> qqline(mod1$res)> shapiro test(residuals(mod1))

  • 7/31/2019 Multiple Regression SPECIALISTICA

    85/93

    > shapiro.test(residuals(mod1))Shapiro-Wilk normality testdata: residuals(mod1)W = 0.7228, p-value = 1.673e-13# 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland,# Russian Federation, Sudan and Togo

    0 20 40 60 80 100 120

    -8

    -6

    -4

    -2

    0

    2

    Standardized Errors

    Index

    rstand

    Statistical methods and applications (2009) Multiple Linear Regression Model 55 / 63

    Finding the best covariates subset (1)

    > stpmod1

  • 7/31/2019 Multiple Regression SPECIALISTICA

    86/93

    s pm s p(m )> result mod2 xtable(summary(mod2))

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) -0.3113 0.4391 -0.71 0.4798energy 0.0023 0.0001 15.25 0.0000

    exports 0.0194 0.0090 2.16 0.0331GNI -0.0001 0.0000 -2.34 0.0208

    > extractAIC(mod1)[1] 7.0000 213.2029> extractAIC(mod2)[1] 4.0000 207.6525

    The same model is selected using R2 adjusted and Mallows Cp criteria as we canfind in Ricci (2006).

    Statistical methods and applications (2009) Multiple Linear Regression Model 56 / 63

    Finding the best covariates subset (2)

    > y x adjr2

  • 7/31/2019 Multiple Regression SPECIALISTICA

    87/93

    > adjr2< leaps(x,y,method adjr2 )> maxadjr(adjr2,8)

    1,2,6 1,2,5,6 1,2,4,6 1,2,3,6 1,2,3,5,6 1,2,4,5,60.800 0.799 0.798 0.798 0.797 0.797

    1,2,3,4,6 1,2,3,4,5,60.796 0.795

    > leapcp Cpplot(leapcp)

    2 3 4 5 6 7

    2

    3

    4

    5

    6

    7

    8

    p

    Cp

    1

    16

    12

    13

    15

    126

    136

    156

    146

    125123124

    1256

    12461236

    1356

    1346

    1456

    12351245

    1235612456

    12346

    12345

    Statistical methods and applications (2009) Multiple Linear Regression Model 57 / 63

    R Functions For Regression Analysis (Ricci, 2005)-Linearmodel

  • 7/31/2019 Multiple Regression SPECIALISTICA

    88/93

    All this material and more may be found at

    http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

    anova: Compute an analysis of variance table for one or more linear model fits ( stasts)coef: is a generic function which extracts model coefficients from objects returned by modeling functions.coefficients is an alias for it (stasts)coeftest: Testing Estimated Coefficients (lmtest)confint: Computes confidence intervals for one or more parameters in a fitted model. Base has a method for

    objects inheriting from class lm (stasts)deviance:Returns the deviance of a fitted model object (stasts)fitted: is a generic function which extracts fitted values from objects returned by modeling functionsfitted.values is an alias for it (stasts)formula: provide a way of extracting formulae which have been included in other objects (stasts)linear.hypothesis: Test Linear Hypothesis (car)lm: is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance andanalysis of covariance (stasts)

    predict: Predicted values based on linear model object (stasts)residuals: is a generic function which extracts model residuals from objects returned by modeling functions(stasts)summary.lm: summary method for class lm (stasts)vcov: Returns the variance-covariance matrix of the main parameters of a fitted model object (stasts)

    Statistical methods and applications (2009) Multiple Linear Regression Model 58 / 63

    R Functions For Regression Analysis (Ricci,2005)-Variable selection

  • 7/31/2019 Multiple Regression SPECIALISTICA

    89/93

    All this material and more may be found athttp://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

    add1: Compute all the single terms in the scope argument that can be added to or dropped from the model,fit those models and compute a table of the changes in fit (stasts)AIC: Generic function calculating the Akaike information criterion for one or several fitted model objects forwhich a log-likelihood value can be obtained, according to the formula -2log-likelihood + knpar, where nparrepresents the number of parameters in the fitted model, and k=2 for the usual AIC, or k = log(n) (n thenumber of observations) for the so-called BIC or SBC (Schwarzs Bayesian criterion) ( stasts)Cpplot: Cp plot (faraway)drop1: Compute all the single terms in the scope argument that can be added to or dropped from the model,fit those models and compute a table of the changes in fit (stasts)extractAIC: Computes the (generalized) Akaike An Information Criterion for a fitted parametric model (stasts)maxadjr: Maximum Adjusted R-squared (faraway)offset: An offset is a term to be added to a linear predictor, such as in a generalised linear model, with knowncoefficient 1 rather than an estimated coefficient (stasts)step: Select a formula-based model by AIC (stasts)update.formula: is used to update model formulae. This typically involves adding or dropping terms, butupdates can be more general (stasts)

    Statistical methods and applications (2009) Multiple Linear Regression Model 59 / 63

    R Functions For Regression Analysis (Ricci,2005)-Diagnostic and graphics

  • 7/31/2019 Multiple Regression SPECIALISTICA

    90/93

    All this material and more may be found athttp://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

    cookd: Cooks Distances for Linear and Generalized Linear Models (car)cooks.distance: Cooks distance (stasts)hat: diagonal elements of the hat matrix (stasts)hatvalues: diagonal elements of the hat matrix (stasts)ls.diag: Computes basic statistics, including standard errors, t- and p-values for the regression coefficients(stasts)rstandard: standardized residuals (stasts)rstudent: studentized residuals (stasts)vif: Variance Inflation Factor (car)plot.lm: Four plots (selectable by which) are currently provided: a plot of residuals against fitted values, aScale-Location plot of sqrt residuals against fitted values, a Normal Q-Q plot, and a plot of Cooksdistances versus row labels (stats)qq.plot: Quantile-Comparison Plots (car)qqline: adds a line to a normal quantile-quantile plot which passes through the first and third quartiles (stats)qqnorm: is a generic function the default method of which produces a normal QQ plot of the values in y (stats)reg.line: Plot Regression Line (car)

    Statistical methods and applications (2009) Multiple Linear Regression Model 60 / 63

    Bibliography

    AUSTIN, P. C. and TU, J. V. (2004). Automated variable selection methodsfor logistic regression produced unstable models for predicting acute

  • 7/31/2019 Multiple Regression SPECIALISTICA

    91/93

    g g p p gmyocardial infarction mortality. Journal of Clinical Epidemiology 57,11381146.BROWN, P. J. (1994). Measurement, Regression and Calibration Oxford,Clarendon.BURNHAM, K. P. and ANDERSON, D. R. (2002). Model selection andmultimodelinference: a practical information-theoretic approach. New York:

    Springer-Verlag.BURMAN, P. (1996). Model fitting via testing. Statistica Sinica 6, 589-601.

    CHATTERJE, S and HADI, A. S. (2006). Regression analysis by example, 4thedition. Wiley & Sons: Hoboken, New Yersey. f

    DER, G. and EVERITT, B. S. (2006). Statistical Analysis of Medical Data

    Using SAS Chapman & Hall/CRC, Boca Raton, Florida.DIZNEY, H. and GROMAN, L. (1967). Predictive validity and differentialachievement in three MLA comparative foreign language tests. Educationaland Psychological Measurement 27, 1127-1130.

    Statistical methods and applications (2009) Multiple Linear Regression Model 61 / 63

    Bibliography

    EVERITT, B. S. (2005). An R and S-PLUS companion to multivariateanalysis. Springer Verlag: London.

  • 7/31/2019 Multiple Regression SPECIALISTICA

    92/93

    FINOS, L., BROMBIN, C. and SALMASO, L. (2009) Adjusting stepwise

    p-values in generalized linear models. Accepted for publication inCommunications in Statistics: Theory and Methods.

    FREEDMAN, L.S., PEE, D. and MIDTHUNE, D.N. (1992). The problem ofunderestimating the residual error variance in forward stepwise regression.The statistician 41, 405412.

    GABRIEL, K. R. (1971). The Biplot Graphic Display of Matrices withApplication to Principal Component Analysis. Biometrika, 58 453467.

    HARSHMAN, R. A. and LUNDY, M. E. (2006). A randomization method ofobtaining valid p-values for model changes selected post hoc. Posterpresented at the Seventy-first Annual Meeting of the Psychometric Society,

    Montreal, Canada, June, 2006. Available at http://publish.uwo.ca/harshman/imps2006.pdf.

    JOHNSON, R. W. (1996) Fitting Percentage of Body Fat to Simple BodyMeasurements. Journal of Statistics Education, 4, (e-journal) seehttp://www.amstat.org/publications/jse/toc.html

    Statistical methods and applications (2009) Multiple Linear Regression Model 62 / 63

    Bibliography

  • 7/31/2019 Multiple Regression SPECIALISTICA

    93/93

    JOLLIFFE, I. T. (1982). A note on the Use of Principal Components inRegression. Journal of the Royal Statistical Society, Series C 31, 300-303.

    JOLLIFFE, I. T. (1986) Principal component analysis. Springer: New York.

    MALLOWS, C.L. (1973) Some comments on Cp. Technometrics 15, 661-675.

    MALLOWS, C.L. (1975) More comments on Cp. Technometrics 37, 362-372.MORRISON, D. F. (1967) Multivariate statistical methods. McGraw-Hill:New York.RICCI, V. (2006) Principali tecniche di regressione con R. Seecran.r-project.org/doc/contrib/Ricci-regression-it.pdf.

    Statistical methods and applications (2009) Multiple Linear Regression Model 63 / 63


Recommended