+ All Categories
Home > Documents > 13 Multiple Regression Part3

13 Multiple Regression Part3

Date post: 06-Jul-2018
Category:
Upload: rama-dulce
View: 221 times
Download: 0 times
Share this document with a friend

of 20

Transcript
  • 8/17/2019 13 Multiple Regression Part3

    1/20

    - 1 -

    MULTIPLE REGRESSION – PART 3

    Topics Outline• Dummy Variables• Interaction Terms

    • Nonlinear Transformations– Quadratic Transformations– Logarithmic Transformations

    Dummy Variables

    Thus far, the examples we have considered involved quantitative explanatory variables such asmachine hours, production runs, price, expenditures. In many situations, however, we must workwith categorical explanatory variables such as gender (male, female), method of payment(cash, credit card, check), and so on. The way to include a categorical variable in the regression

    model is to represent it by a dummy variable.A dummy variable (also called indicator or 0 – 1 variable ) is a variable with possible values 0 and 1.It equals 1 if a given observation is in a particular category and 0 if it is not.

    If a given categorical explanatory variable has only two categories, then you can define onedummy variable xd to represent the two categories as

    =otherwise0

    1categoryinisnobservatiotheif 1d x

    Example 1

    Data collected from a sample of 15 houses are stored in Houses.xlsx .

    House Value($ thousands)Size

    (thousands of square feet)Presence of Fireplace

    1 234.4 2.00 Yes2 227.4 1.71 NoM M M M

    14 233.8 1.89 Yes15 226.8 1.59 No

    (a) Develop a regression model for predicting the assessed value y of houses, based on the size 1 x of the house and whether the house has a fireplace.

    To include the categorical variable for the presence of a fireplace, the dummy variable isdefined as

    =fireplaceahavenotdoeshousetheif 0

    fireplaceahashousetheif 12 x

  • 8/17/2019 13 Multiple Regression Part3

    2/20

    - 2 -

    To code this dummy variable in Excel, enter=IF(C2="Yes",1,0)in cell D2 and drag it down. The data become:

    House Value Size Fireplace1 234.4 2.00 12 227.4 1.71 0M M M M

    14 233.8 1.89 115 226.8 1.59 0

    Assuming that the slope of assessed value with the size of the house is the same for housesthat have and do not have a fireplace, the multiple regression model is

    ε β β α +++= 2211 x x y

    Here are the regression results for this model.

    Regression StatisticsMultiple R 0.9006R Square 0.8111Adjusted R Square 0.7796Standard Error 2.2626Observations 15ANOVA

    df SS MS F Significance FRegression 2 263.7039 131.8520 25.7557 0.0000Residual 12 61.4321 5.1193Total 14 325.1360

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 200.0905 4.3517 45.9803 0.0000 190.6090 209.5719Size 16.1858 2.5744 6.2871 0.0000 10.5766 21.7951Fireplace 3.8530 1.2412 3.1042 0.0091 1.1486 6.5574

    (b) Interpret the regression coefficients.

    The regression equation is

    21 8530.31858.160905.200ˆ x x y ++=

    2211ˆ xb xba y ++=

  • 8/17/2019 13 Multiple Regression Part3

    3/20

    - 3 -

    For houses with a fireplace, you substitute 2 x = 1 into the regression equation:)1(8530.31858.160905.200ˆ 1 ++= x y

    11858.169435.203ˆ x y += (1)

    112 )(ˆ xbba y ++=

    For houses without a fireplace, you substitute 2 x = 0 into the regression equation:)0(8530.31858.160905.200ˆ 1 ++= x y

    11858.160905.200ˆ x y += (2)

    11ˆ xba y += Interpretation of a

    The expected value of a house with 0 square feet and no fireplace is $200,091, whichobviously does not make sense in this context.

    Interpretation of 1b The effect of 1 x on y is the same for houses with or without a fireplace. When 1 x increasesby one unit, y is expected to change by 1b units for houses with or without a fireplace.Thus, holding constant whether a house has a fireplace, for each increase of 1 thousandsquare feet in the size of the house, the predicted assessed value is estimated to increase by16.1858 thousand dollars (i.e., $16,185.80).

    Interpretation of 2b The slope of equations (1) and (2) is the same ( 1858.161 =b ), but the intercepts differ by anamount 8530.32 =b . Geometrically, the two equations correspond to two parallel lines thatare a vertical distance 8530.32 =b apart. Therefore, the interpretation of 2b is that itindicates the difference between the two intercepts 203.9435 and 200.0905.

    Thus, holding constant the size of the house, the presence of a fireplace is estimated toincrease the predicted assessed value of the house by 3.8530 thousand dollars (i.e. $3,853).

  • 8/17/2019 13 Multiple Regression Part3

    4/20

    - 4 -

    (c) Does the regression equation provide a good fit for the observed data?

    The test statistic for the slope of the size of the house with assessed value is 6.2871,and the P -value is approximately zero.The test statistic for presence of a fireplace is 3.1042, and the P -value is 0.0091.

    Thus, each of the two variables makes a significant contribution to the model.In addition, the coefficient of determination indicates that 81.11% of the variation in assessedvalue is explained by variation in the size of the house and whether the house has a fireplace.

    When a categorical variable has two categories (fireplace, no fireplace), one dummy variable is used.When a categorical variable has m categories, m – 1 dummy variables are required, with eachdummy variable coded as 0 or 1.

    Example 2

    Define a multiple regression model using sales ( y) as the response variable and price ( 1 x ) andpackage design as explanatory variables. Package design is a three-level categorical variablewith designs A, B, or C.

    Solution:

    To model the m = 3-level categorical variable package design, m – 1 = 3 – 1 = 2 dummyvariables are needed:

    =otherwise0

    usedisAdesignpackageif 12 x

    =otherwise0

    usedisBdesignpackageif 13 x

    Therefore, the regression model is

    ε β β β α ++++= 332211 x x x y

    Here the package design is coded as:

    Package design A ( )2 x B ( )3 x

    A 1 0

    B 0 1C 0 0

  • 8/17/2019 13 Multiple Regression Part3

    5/20

    - 5 -

    Interaction Terms

    In the regression models discussed so far, the effect an explanatory variable has on the responsevariable has been assumed to be independent of the other explanatory variables in the model.An interaction occurs if the effect of an explanatory variable on the response variable changes

    according to the value of a second explanatory variable.

    For example, it is possible for advertising to have a large effect on the sales of a product whenthe price of a product is low. However, if the price of the product is too high, increases inadvertising will not dramatically change sales. In other words, you cannot make generalstatements about the effect of advertising on sales. The effect that advertising has on sales isdependent on the price. Therefore, price and advertising are said to interact.

    When interaction between two variables is present, we cannot study the effect of one variable onthe response y independently of the other variable. Meaningful conclusions can be developedonly if we consider the joint effect that both variables have on the response.

    To account for the effect of two explanatory variables i x and j x acting together, an interaction

    term (sometimes referred to as a cross-product term ) ji x x is added to the model.

    Example 1 (Continued)

    (d) Formulate a regression model to evaluate whether an interaction exists.

    In the regression model, we assumed that the effect the size of the home has on the assessedvalue is independent of whether the house has a fireplace. In other words, we assumed that theslope of assessed value with size is the same for houses with fireplaces as it is for houseswithout fireplaces. If these two slopes are different, an interaction exists between the size ofthe home and the fireplace.

    To evaluate whether an interaction exists, the following model is considered:

    ε β β β α ++++= 2132211 x x x x y

    where 21 x x is the interaction term. With this new 213 x x x = variable in the model, it meansthe value of 2 x changes how 1 x affects y.

    If we factor out 1 x we get:ε β β β α ++++= 221231 )( x x x y

    Thus, each value of 2 x yields a different slope in the relationship between y and 1 x .Expressed in other words, the parameter 3 β of the interaction term gives an adjustment to theslope of 1 x for the possible values of 2 x .

  • 8/17/2019 13 Multiple Regression Part3

    6/20

    - 6 -

    (e) Interpret the estimated regression equation.

    The data for the model with an interaction term are:

    House Value Size Fireplace Size × Fireplace1 234.4 2.00 1 2.002 227.4 1.71 0 0.00M M M M M

    14 233.8 1.89 1 1.8915 226.8 1.59 0 0.00

    The regression output for this model is:

    Regression Statistics

    Multiple R 0.9179R Square 0.8426Adjusted R Square 0.7996Standard Error 2.1573Observations 15

    ANOVAdf SS MS F Significance F

    Regression 3 273.9441 91.3147 19.6215 0.0001Residual 11 51.1919 4.6538Total 14 325.1360

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 212.9522 9.6122 22.1544 0.0000 191.7959 234.1084Size 8.3624 5.8173 1.4375 0.1784 -4.4414 21.1662Fireplace -11.8404 10.6455 -1.1122 0.2898 -35.2710 11.5902Size*Fireplace 9.5180 6.4165 1.4834 0.1661 -4.6046 23.6406

    The estimated regression equation is

    2132211ˆ x xb xb xba y +++=

    2121 5180.98404.113624.89522.212ˆ x x x x y +−+=

    To see the interaction effect, we have to evaluate this equation for the possible values of 2 x .

  • 8/17/2019 13 Multiple Regression Part3

    7/20

    - 7 -

    For houses with a fireplace, 2 x = 1 and the regression equation is:

    )1(5180.9)1(8404.113624.89522.212ˆ 11 x x y +−+=

    18804.171118.201ˆ x y += (3)

    1312 )()(ˆ xbbba y +++=

    For houses without a fireplace, 2 x = 0 and the regression equation is:

    )0(5180.9)0(8404.113624.89522.212ˆ 11 x x y +−+=

    13624.89522.212ˆ x y += (4)

    11ˆ xba y +=

    Interpretation of 2b The coefficient of the indicator variable 2b = – 11.8404 provides a different intercept toseparate the houses with and without a fireplace at the origin (where Size = 0 sq ft).Here it does not make great sense. Literally, it says that for houses with size of 0 sq ft, thevalue of a house without a fireplace is about $11,840 higher than the value of a house with afireplace.

    Interpretation of 3b The coefficient of the interaction term 3b = 9.5180 says that the slope relating the size of the houseto its value is steeper by $9,518 for houses with a fireplace than for houses without a fireplace.

    The two lines, (3) and (4), meet at (size, value) = (1.2440, 223.3550). Thus, the value of ahouse with a size greater than 1,244 sq ft is higher when the house has a fireplace.

  • 8/17/2019 13 Multiple Regression Part3

    8/20

    - 8 -

    (f) Does the interaction term make a significant contribution to the regression model?

    To test for the existence of an interaction, the null and alternative hypotheses are:

    0: 30 = β H 0: 3 ≠ β a H

    The test statistic for the interaction of size and fireplace is 1.4834 with a P -value = 0.1661.Because the P -value is large, you do not reject the null hypothesis.

    Thus, although the slope adjuster 3b = 9.5180 implies the value gap between houses withand without fireplace increases with house size, this effect is not really significant.In other words, the interaction term does not make a significant contribution to the model,given that size and presence of a fireplace are already included. Therefore, you can concludethat the slope of assessed value with size is the same for houses with fireplaces and withoutfireplaces.

    Note:If the correlation between interaction terms and the original variables in the regression is high,collinearity problems can result. In a regression with several variables, the number of interactionvariables that could be created is very large and the likelihood of collinearity problems is high.Therefore, it is wise not to use interaction variables indiscriminately. There should be some goodreason to suspect that two variables might be related or some specific question that can beanswered by an interaction variable before this type of variable is used.

  • 8/17/2019 13 Multiple Regression Part3

    9/20

    - 9 -

    Nonlinear Transformations

    The general linear model has the form

    ε β β β α +++++= k k x x x y L2211

    It is linear in the sense that the right side of the equation is a constant plus a sum of products ofconstants and variables. However, there is no requirement that the response variable y or theexplanatory variables 1 x through k x be the original variables in the data set. Most often they are,but they can also be transformations of original variables. You can transform the responsevariable y or any of the explanatory variables, the x’s. You can also do both.

    The purpose of nonlinear transformations is usually to “straighten out” the points in ascatterplot in order to overcome violations of the assumptions of regression or to make the formof a model linear. They can also arise because of economic considerations. Among the manytransformations available are the square root, the reciprocal, the square, and transformationsinvolving the common logarithm (base 10) and the natural logarithm (base e).

    The type of transformation to correct for curvilinearity is not always obvious. Differenttransformations may be tried and the one that appears to do the best job chosen.There may be theoretical results as well to support the use of certain transformations in certain cases.As always, subject matter expertise is important in any analysis. If several different transformationsstraighten out the data equally well, the one that is easiest to interpret is preferred.

    The most frequently used nonlinear transformations in business and economic applications arethe quadratic and logarithmic transformations.

    Quadratic Transformations

    One of the most common nonlinear relationships between the response variable y and an explanatoryvariable x is a curvilinear relationship in which y increases (or decreases) at a changing rate forvarious values of x. The quadratic regression model defined below can be used to analyze this typeof relationship between x and y.

    ε β β α +++= 21211 x x y

    This model is similar to the multiple regression model except that the second explanatoryvariable is the square of the first explanatory variable. Once again, the least squares method canbe used to compute sample regression coefficients 1, ba , and 2b as estimates of the populationparameters

    1, β α , and

    2 β . The estimated regression equation for the quadratic model is

    21211ˆ xb xba y ++=

    In this equation, the first regression coefficient a represents the y intercept; the secondregression coefficient 1b represents the linear effect; and the third regression coefficient 2b represents the quadratic effect.

  • 8/17/2019 13 Multiple Regression Part3

    10/20

    - 10 -

    Example 3Fly AshFly ash is an inexpensive industrial waste by-product that can be used as a substitute for Portland cement,a more expensive ingredient of concrete. How does adding fly ash affect the strength of concrete?

    Batches of concrete were prepared in which the percentage of fly ash ranged from 0% to 60%.Data were collected from a sample of 18 batches and stored in FlyAsh.xlsx .

    Batch Strength (psi) Fly Ash %1 4779 02 4706 0M M M

    17 5030 6018 4648 60

    (a) A linear model has been fit to these data. Below is the regression output.What do these results show?

    Regression Statistics

    Multiple R 0.4275R Square 0.1827Adjusted R Square 0.1317Standard Error 460.7787

    Observations 18

    ANOVA

    df SS MS F Significance F

    Regression 1 759618.0571 759618.0571 3.5778 0.0768

    Residual 16 3397072.4429 212317.0277Total 17 4156690.5000

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 4924.5952 213.2991 23.0877 0.0000 4472.4213 5376.7691

    Fly Ash% 10.4171 5.5074 1.8915 0.0768 -1.2579 22.0922

  • 8/17/2019 13 Multiple Regression Part3

    11/20

    - 11 -

    The t test indicates that the linear term is significant at the 0.10 (but not at the 0.05) level ofsignificance ( P -value = 0.0768). The extremely low coefficient of determination( 2r = 0.1827) shows that the linear model explains only about 18% of the variation in strength.

    Moreover, the scatterplot of the data and the plot of residuals versus fitted values indicate that

    a linear model is not appropriate for these data. For example, the scatterplot of Strength versusFly Ash % indicates an initial increase in the strength of the concrete as the percentage of flyash increases. The strength appears to level off and then drop after achieving maximumstrength at about 40% fly ash. Strength for 50% fly ash is slightly below strength at 40%,but strength at 60% is substantially below strength at 50%.

    Therefore, to estimate strength based on fly ash percentage, a quadratic model seems moreappropriate for these data, not a linear model.

    (b) The data for the quadratic model are:

    Batch Strength (psi) Fly Ash % (Fly Ash %) 2

    1 4779 0 02 4706 0 0M M M M

    17 5030 60 360018 4648 60 3600

    Below are the regression results for the quadratic model.What does the residual plot show? What is the estimated regression equation?

    Regression Statistics

    Multiple R 0.8053

    R Square 0.6485Adjusted R Square 0.6016Standard Error 312.1129

    Observations 18

    ANOVA

    df SS MS F Significance F

    Regression 2 2695473.4897 1347736.7448 13.8351 0.0004

    Residual 15 1461217.0103 97414.4674Total 17 4156690.5000

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 4486.3611 174.7531 25.6726 0.0000 4113.8836 4858.8386Fly Ash% 63.0052 12.3725 5.0923 0.0001 36.6338 89.3767Fly Ash%^2 -0.8765 0.1966 -4.4578 0.0005 -1.2955 -0.4574

  • 8/17/2019 13 Multiple Regression Part3

    12/20

    - 12 -

    The curved pattern in the residual plot is gone. The points in the residual plot show a randomscatter with an approximately equal spread above and below the horizontal 0 line.

    From the regression output: a = 4,486.3611 1b = 63.0052 2b = –0.8765Therefore, the quadratic regression equation is

    211 8765.00052.633611.486,4ˆ x x y −+=

    Predicted Strength = 4,486.3611 + 63.0052 Fly Ash% – 0.8765 ( Fly Ash%) 2

    The following figure is a scatterplot that shows the fit of the quadratic regression curve to theoriginal data. (In Excel, use the option “Polynomial” to add the “trendline”.)

    (c) Interpret the regression coefficients.

    The y intercept 4,486.3611 is the predicted strength when the percentage of fly ash is 0.

    To interpret the coefficients 1b = 63.0052 and 2b = –0.8765, observe that after an initialincrease, strength decreases as fly ash percentage increases. This nonlinear relationship isfurther demonstrated by predicting the strength for fly ash percentages of 20, 40, and 60.Using the quadratic regression equation,

    Predicted Strength = 4,486.3611 + 63.0052 FlyAsh% – 0.8765 Fly Ash%^ 2

    For FlyAsh% = 20, Predicted Strength = 4,486.3611 + 63.0052(20) – 0.8765(20)2

    = 5,395.865For FlyAsh% = 40, Predicted Strength = 4,486.3611 + 63.0052(40) – 0.8765(40) 2 = 5,604.169For FlyAsh% = 60, Predicted Strength = 4,486.3611 + 63.0052(60) – 0.8765(60) 2 = 5,111.273

    Thus, the predicted concrete strength for 40% fly ash is 208.304 psi above the predictedstrength for 20% fly ash, but the predicted strength for 60% fly ash is 492.896 psi below thepredicted strength for 40% fly ash.

  • 8/17/2019 13 Multiple Regression Part3

    13/20

    - 13 -

    (d) Test the significance of the quadratic model.

    The null and alternative hypotheses for testing whether there is a significant overallrelationship between strength y and fly ash percentage 1 x are as follows:

    0:210

    == β β H (There is no overall relationship between1

    x and y.)0and/or: 210 ≠ β β H (There is an overall relationship between 1 x and y.)

    The overall F test statistic used for this test is F = 13.8351. The corresponding P -value is 0.0004.Because of the small P -value, you reject the null hypothesis and conclude that there is asignificant overall relationship between strength and fly ash percentage.

    (e) Test the quadratic effect.

    To test the significance of the contribution of the quadratic term, you use the following nulland alternative hypotheses:

    0: 20 = β H (Including the quadratic term does not significantly improve the model.)0: 20 ≠ β H (Including the quadratic term significantly improves the model.)

    The test statistic and the corresponding P -value are: t = – 4.4578, P -value = 0.0005You reject 0 H and conclude that the quadratic term is statistically significant and should bekept in the model.

    (f) How good is the quadratic model? Is it better than the linear model?

    Model 2r es

    ε β α ++=

    11 x y 18% 461ε β β α +++= 21211 x x y 65% 312

    The coefficient of determination 2r = 0.6485 shows that about 65% of the variation in strength isexplained by the quadratic relationship between strength and the percentage of fly ash.The percentage variation explained by the linear model is much smaller: about 18%.

    Another indicator that the regression has been improved by adding the quadratic term is the reductionin the standard error es from about 461 in the linear model to about 312 in the quadratic model.

    Thus, based on our findings in (a) through (f), we can conclude that the quadratic model issignificantly better than the linear model for representing the relationship between strengthand fly ash percentage.

    Note:Although this was not the case in this example, but it can happen that in a quadratic model thequadratic term is significant and the linear term is not. In such situations (for statistical reasonsnot discussed here), the general rule is to keep the linear term despite of its insignificance.

  • 8/17/2019 13 Multiple Regression Part3

    14/20

    - 14 -

    Logarithmic Transformations

    If scatterplots suggest nonlinear relationships, there are many nonlinear transformations of y and/or the x’s that could be tried in a regression analysis. The reason that logarithmictransformations are arguably the most frequently used nonlinear transformations, besides the fact

    that they often produce good fits, is that they can be interpreted naturally in terms of percentagechanges. In real studies, this interpretability is an important advantage over other potentialnonlinear transformations.

    The log transformations put values on a different scale that compresses large distances so thatthey are more comparable to smaller distances.

    It is common in business and economic applications to use natural ln (base e) logarithms,although the base used is usually not important.

    Interpretation of a slope coefficient b when log is used

    Case 1: Predicted LL +++= xba y log ( x is log-transformed, y is not log-transformed)

    The expected change in y (increase or decrease depending on the sign of b) when x increases by1% is approximately 0.01 b.

    Example: Predicted x y log34.067.5 +=

    This regression equation implies that every 1% increase in x (for example, from 200 to 202) isaccompanied by about (0.01)(0.34) = 0.0034 increase in y.

    Case 2: Predicted LL +++= bxa ylog ( x is not log-transformed, y is log-transformed)

    Whenever x increases by 1 unit, the expected value of y changes (increases or decreasesdepending on the sign of b) by a constant percentage, and this percentage is approximately equalto b written as a percentage (that is, 100 b% ).

    Example: Predicted x y 34.067.5log +=

    b = 0.34 and written as a percentage it is (100)(0.34) = 34%.When x increases by 1 unit, the expected value of y increases by approximately 34%.

    Case 3: Predicted LL +++= xba y loglog (both x and y are log-transformed)

    The expected change in y (increase or decrease depending on the sign of b) when x increases by1% is approximately b%.

    Example: Predicted x y log34.067.5log +=

    For every 1% increase in x , y is expected to increase by approximately 0.34%.

  • 8/17/2019 13 Multiple Regression Part3

    15/20

    - 15 -

    Example 4Fuel ConsumptionThe file Fuel_Consumption.xlsx contains data on the fuel consumption in gallons per capita foreach of the 50 states and Washington, DC. Here is part of the data.

    State FuelCon Population Area DensityAlabama 547.92 4486508 50750 88.4041Alaska 440.38 643786 570374 1.1287

    M M M M M Wyoming 715.55 498703 97105 5.1357Washington D.C. 289.99 570898 61 9358.9836

    The goal is to develop a regression equation to predict fuel consumption based on the populationdensity (defined as population/area).

    The scatterplot of FuelCon versus Density is shown below.

    Looking at the scatterplot, it is clear that this is not a linear relationship.One thing to note about this plot is how the values spread out on the x axis. At the left-hand sideof the x axis, the values are clumped together. Moving from left to right, the values becomeprogressively more spread out. This suggests the use of a log transformation of Density.The log transformation evens out the successively larger distances between the values.

    The scatterplot of FuelCon versus the natural logarithm of Density (LogDensity) is shown below.

  • 8/17/2019 13 Multiple Regression Part3

    16/20

    - 16 -

    The relationship appears to be linear.

    On the next page are the regression results using Density and using LogDensity as anexplanatory variable. The following table provides summary statistics for the two models.

    Model 2r es

    ε β α ++= x y 20.6% 65.17

    ε β α ++= x y log 27.8% 62.16

    The regression results indicate that using LogDensity as the explanatory variable produces abetter model fit than the regression using Density.

    The estimated regression equation for the logarithmic model is

    Predicted FuelCon = 597.1867 – 24.5308 Log Density

    This equation shows that if the population density increases by 1%, the average fuelconsumption will decrease by (0.01)(24.5308) = 0.2453 gallons per capita.

  • 8/17/2019 13 Multiple Regression Part3

    17/20

    - 17 -

    Regression of FuelCon on Density

    Regression StatisticsMultiple R 0.4538R Square 0.2059

    Adjusted R Square 0.1897Standard Error 65.1675Observations 51

    ANOVAdf SS MS F Significance F

    Regression 1 53960.7466 53960.7466 12.7062 0.0008Residual 49 208093.4001 4246.8041Total 50 262054.1466

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 495.6283 9.4811 52.2752 0.0000 476.5752 514.6814Density -0.0251 0.0070 -3.5646 0.0008 -0.0392 -0.0109

    Regression of FuelCon on LogDensity

    Regression StatisticsMultiple R 0.5269R Square 0.2776

    Adjusted R Square 0.2629Standard Error 62.1561Observations 51

    ANOVAdf SS MS F Significance F

    Regression 1 72748.3136 72748.3136 18.8302 0.0001Residual 49 189305.8330 3863.3843Total 50 262054.1466

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 597.1867 26.9612 22.1499 0.0000 543.0062 651.3671LogDensity -24.5308 5.6531 -4.3394 0.0001 -35.8911 -13.1705

  • 8/17/2019 13 Multiple Regression Part3

    18/20

    - 18 -

    Example 5Imports and GDP

    The gross domestic product (GDP) and dollar amount of total imports (Imports), both in billionsof dollars for 25 countries are saved in Imports_and_GDP.xlsx .

    Country Imports GDPArgentina 20.300 391.000Australia 68.000 528.000

    M M M United Kingdom 330.100 1520.000United States 1148.000 10082.000

    The objective is to find an equation showing the relationship between Imports ( y) and GDP ( x).

    The scatterplot of Imports versus GDP shows that this is not a linear relationship.

    At the left-hand side of the x axis and the bottom of the y axis, the values are clumped together.Moving from left to right on the x axis, the values become more spread out. The same thing

    happens when moving up the y axis – the values become progressively more spread out.This suggests the use of a log transformation for both the x and y variables.

    As the scatterplot of LogImports versus LogGDP below shows, the relationship appears muchcloser to linear.

  • 8/17/2019 13 Multiple Regression Part3

    19/20

    - 19 -

    The results for the regression of LogImports on LogGDP are shown below.

    The regression of Imports on GDP is not shown for comparison purposes, because the responsevariable y has been transformed to log y and the usual comparisons are not valid. In particular,the interpretations of se and r 2 are different because the units of the response variable arecompletely different. For example, increases in 2r when the natural logarithm transformation isapplied to y do not necessarily suggest an improved model. Because of the above, it is difficultto compare this regression to any model using y as the response variable.

    Note that transformations of the explanatory variables do not create this type of problem.It is only when the y variable is transformed that comparison becomes more difficult.

    Regression StatisticsMultiple R 0.9168R Square 0.8404Adjusted R Square 0.8335Standard Error 0.9142Observations 25ANOVA

    df SS MS F Significance FRegression 1 101.2551 101.2551 121.1527 0.0000Residual 23 19.2226 0.8358Total 24 120.4777

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -1.1275 0.4346 -2.5941 0.0162 -2.0265 -0.2284LogGDP 0.8670 0.0788 11.0069 0.0000 0.7041 1.0300

  • 8/17/2019 13 Multiple Regression Part3

    20/20

    - 20 -

    The regression model isε β α ++= x y loglog

    The estimated regression equation is

    Predicted LogImports = –1.1275 + 0.8670 LogGDP

    The slope coefficient 0.8670 indicates that if the GDP increases by 1%, then the Imports areexpected to increase by approximately 0.8670% (about 1%).

    If the estimated regression equation is used for forecasting, natural logs of the y values areforecasted, not the y values themselves. For example, what is the forecast of Imports for acountry with GDP = 500 billions of dollars?

    Using the estimated regression equation,

    LogImports = –1.1275 + 0.8670 LogGDP = –1.1275 + 0.8670 Log (500)= –1.1275 + 0.8670 (6.2146)= 4.2606

    The forecast value for y (Imports) must be computed as

    Imports = 2606.4e = 70.85 ≈ 71 billions of dollars


Recommended