+ All Categories
Home > Documents > 16 Review of Part II

16 Review of Part II

Date post: 06-Jul-2018
Category:
Upload: rama-dulce
View: 215 times
Download: 0 times
Share this document with a friend

of 17

Transcript
  • 8/17/2019 16 Review of Part II

    1/17

    - 1 -

    REVIEW OF PART II

    Topics Outline

     Inference for Regression

     Multiple Regression

     Building Regression Models

    Terms and Concepts

    1. Equation of a regression model 18. ANOVA table

    2. Equation of the true population line/plane/surface 19. Sums of Squares (SST, SSR, SSE)

    3. Equation of the least squares regression

    line/plane/surface

    20. Mean squares (MSR, MSE)

    4. Least squares criterion 21. Collinearity

    5. Regression coefficients and their interpretation 22. Variance inflation factor (VIF )

    6. Correlation r and its interpretation 23. Correlation matrix

    7. Coefficient of determination 2r   and its interpretation 24. Validation of the fit

    8. Regression standard error e s  and its interpretation 25. Dummy (indicator) variables

    9. Confidence intervals for the intercept (  ) and for

    the regression slopes (   ’s)26. Interaction (cross-product) terms

    10. The overall F  test27. Nonlinear transformations

    (quadratic and logarithmic)

    11. t   test for a regression slope      28. The partial F  test

    12. Confidence interval for a mean response 29. Adjusted 2r   

    13. Prediction interval for an individual response 30.  pC   statistic

    14. Residuals and residual plots 31. Include/exclude decisions

    15. Normal probability plot (Q-Q plot) 32. Principle of parsimony

    16. Regression assumptions and how to check them 33. Building regression models

    17. Excel regression output (what each cell represents

    and what the connections between the cells are)

    34. Variable selection procedures

    (forward selection, backward elimination,

    stepwise regression, best subsets regression

  • 8/17/2019 16 Review of Part II

    2/17

    - 2 -

    Example 1 Florida reappraises real estate every year, so the county appraiser’s Web site lists the current

    “fair market value” of each piece of property. Property usually sells for somewhat more than theappraised market value. Data for the appraised market values and actual selling prices(in thousands of dollars) of 16 condominium units sold in a beachfront building over a 19-month

     period are stored in the file Condominiums.xlsx.

    Condominium Selling Price Appraised Value

    1 850 758.02 900 812.7

         15 1325 1031.816 845 586.7

    Excel output for a linear regression of selling price on appraised value is shown on the next page.

    (a) Write the equation for the model of the population regression line.

             x y  

    (b) Write the equation of the true population regression line.

     x y          

    (c) What is the equation of the least-squares regression line for predicting selling pricefrom appraised value?

     x y 0466.127.127ˆ    

    (d) What is the correlation between appraised value and selling price?

    The correlation r  is the square root of .2

    r   93.0861.0   r   

    (We take the positive square root because the sign of r  must be the same as the sign ofthe slope, 1.0466.)

    Reminder: For simple and multiple regression, r  is the correlation between the observed values of y 

    and the predicted values  ŷ . For simple linear regression, r  is also the correlation between  x and  y.

    (e) Explain why the pattern you see on the residual plot agrees with the conditions oflinear relationship and constant standard deviation needed for regression inference.

    On the residual plot, as usual a horizontal line is added at residual zero, the mean of the residuals.This line corresponds to the regression line in the plot of selling price against appraised value.The residuals show a random scatter about the line, with roughly equal vertical spread acrosstheir range. This is what we expect when the conditions for regression inference hold.

    (f) Does the histogram of the residuals suggest lack of normality?

    The distribution of residuals has a bit of a cluster at the left, but there are no outliers orother strong deviations from normality that would prevent regression inference.

  • 8/17/2019 16 Review of Part II

    3/17

    - 3 -

    Regression Statistics

    Multiple R 0.9277

    R Square 0.8606

    Adjusted R Square 0.8506

    Standard Error 69.7299

    Observations 16

    ANOVA

    df SS MS F Significance F

    Regression 1 420072.1418 420072.1418 86.3945 0.0000

    Residual 14 68071.6082 4862.2577

    Total 15 488143.7500

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 127.2705 79.4892 1.6011 0.1317 -43.2168 297.7578

    Appraised Value 1.0466 0.1126 9.2949 0.0000 0.8051 1.2881

    y = 1.0466x + 127.27

    R² = 0.8606

    500

    700

    900

    1100

    1300

    1500

    400 600 800 1000

       S   e    l    l   i   n   g

       P   r   i   c   e

    Appraised Value

    Fitted Line Plot

    -100

    -50

    0

    50

    100

    150

    400 600 800 1000 1200   R   e   s   i    d   u

       a    l   s

    Appraised Value

    Residuals versus Appraised Value

    0

    1

    2

    3

    4

    5

    6

       F   r   e   q   u   e   n   c   y

    Residual

    Histogram of the Residuals

  • 8/17/2019 16 Review of Part II

    4/17

    - 4 -

    (g) How many degrees of freedom does the t  distribution used for statistical inference on these data have

    There are n = 16 data pairs, so df = n  –  1 –  1 = 16 –  2 = 14.

    Reminder:

    df = n  –  k   –  1, where n is the number of observations, k  is the number of explanatory variables.

    (h) Explain what the slope of the true regression line     means in this setting.

        is the average rate of increase in selling price in a population of condominium units when

    appraised value increases by $1,000.

    (i) Find a 95% confidence interval for the population slope    .

    bSE t b *  = 1.0466 ± (2.145)( 0.1126) = 1.0466 ± 0.2415 = 0.8051 to 1.2881

    (j) Is there significant evidence that selling price increases as appraised value increases?

    To test0:0      H   

    0:     a H   

    we use the test statistic bSE 

    bt  9.2949 with df = 14. The P -value is 0.000.

    We reject the null hypothesis and conclude that there is a positive linear relationship betweenselling prices and appraised values of beach houses in Florida.

    (k) What is the selling price of a house appraised at $802,600?

    2712.967)600.802(0466.127.127ˆ    y

     The selling price of a house appraised at $802,600 is $967,271.

    (l) What is a 95% interval for the mean selling price of a unit appraised at $802,600?(Use SE mean = 21.6387.)

    For a mean selling price, we should use the confidence interval :

    415.462712.967)6387.21(145.22712.967**ˆ     meanSE t  y  

    $1,014,000 to000,921$  

    (m) Hamada owns a unit in this building appraised at $802,600.

    What is a 95% interval for this appraised value? (Use SE ind  = 73.0102.)

    For a range of values for an individual unit, we should use the prediction interval .Hamada can be 95% confident that her unit would sell for

    6069.1562712.967)0102.73(145.22712.967**ˆ     ind SE t  y  

    $1,124,000 to000,$811  

  • 8/17/2019 16 Review of Part II

    5/17

    - 5 -

    Example 2

    Demand and Cost for Electricity

    The Public Service Electric Company produces different quantities of electricity each month,depending on the demand. The file Cost_of_Power.xlsx lists the number of Units of electricity

     produced and the total Cost (in dollars) of producing these units for a 36-month period.

    Month Cost Units

    1 45623 6012 46507 738

         35 45218 70536 45357 637

    (a) What does the scatterplot of Cost versus Units reveal about the relationship between Cost and Units?

    The scatterplot indicates a definite positive relationship and one that is nearly linear.However, there is also some evidence of curvature in the plot. The points increase slightlyless rapidly as Units increases from left to right. In economic terms, there might beeconomies of scale, so that the marginal cost of electricity decreases as more units ofelectricity are produced.

    (b) The output for a simple linear regression is shown on the next page.Does the residual plot suggest the need for a nonlinear transformation?

    The residuals to the far left and the far right are all negative, whereas the majority of theresiduals in the middle are positive. This negative-positive-negative behavior of residualssuggests a parabola. Admittedly, the pattern is far from a perfect parabola because there areseveral negative residuals in the middle. However, this plot certainly suggests nonlinear behavior and exploring a quadratic relationship with the square of Units included in theequation is reasonable.

    25000

    30000

    35000

    40000

    45000

    50000

    200 300 400 500 600 700 800 900

           C     o     s      t

    Units

    Scatterplot of Cost vs Units

  • 8/17/2019 16 Review of Part II

    6/17

    - 6 -

    Regression output for the linear model

    MultipleR-Square

    Adjusted StErr of

    Summary R R-Square Estimate

    0.8579 0.7359 0.7282 2733.7

    Degrees of Sum of Mean ofF-Ratio p-Value

     ANOVA Table Freedom Squares Squares

    Explained 1 708085273.8 708085273.8 94.7481 < 0.0001

    Unexplained 34 254093815.2 7473347.506

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant 23651.5 1917.1 12.3369 < 0.0001 19755.4 27547.6

    Units 30.533 3.137 9.7339 < 0.0001 24.158 36.908

    (c) The regression output for estimating a quadratic relationship between Cost and Units isshown on the next page. What is the estimated regression equation? Does it provide a betterfit than the linear equation?

    The estimated regression equation is

     Predicted Cost  = 5792.80 + 98.350Units  –  0.0600(Units)2

    The graph of the regression equation superimposed on the scatterplot of Cost versus Unitsshows a reasonably good fit, plus an obvious curvature.

    The quadratic model provides a better fit as indicated by the coefficient of determination 2r   

    which has increased from 73.6% to 82.2% and the standard error of estimate e s  which has

    decreased from $2,734 to $2,281.

    -5000.0

    -4000.0

    -3000.0

    -2000.0

    -1000.0

    0.0

    1000.0

    2000.0

    3000.0

    4000.0

    5000.0

    30000 35000 40000 45000 50000       R     e     s       i       d     u     a       l

    Fit

    Scatterplot of Residual vs Fit

  • 8/17/2019 16 Review of Part II

    7/17

    - 7 -

    Regression output for the quadratic modelMultiple

    R-SquareAdjusted StErr of

    Summary R R-Square Estimate

    0.9064 0.8216 0.8108 2280.800

    Degrees of Sum of Mean of F-Ratio p-Value

     ANOVA Table Freedom Squares Squares

    Explained 2 790511518.3 395255759.1 75.9808 < 0.0001

    Unexplained 33 171667570.7 5202047.597

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant 5792.7983 4763.0585 1.2162 0.2325 -3897.7171 15483.3137

    Units 98.3504 17.2369 5.7058 0.0000 63.2817 133.4191

    (Units)^2 -0.0600 0.0151 -3.9806 0.0004 -0.0906 -0.0293

    (d) Interpret the regression coefficients.

    The interpretation of the y intercept is that the predicted cost for zero units of electricity produced is $5,792.80.

    There is no easy way to interpret the slope coefficients in a quadratic equation. For example,you can't conclude from the 98.35 coefficient of Units that Cost increases by 98.35 dollarswhen Units increases by one. The reason is that when Units increases by one, (Units)2 doesn'tstay constant; it also increases. You can provide instead a qualitative description of the

    relationship between x and y from the signs of the coefficients1

    b and2

    b . For example if1

    b  is

     positive and2

    b  is negative (as it is in our example), then y increases as x increases, but the

    marginal rate of change decreases as x increases.

    25000

    30000

    35000

    40000

    45000

    50000

    200 300 400 500 600 700 800 900

           C     o     s      t

    Units

    Quadratic Fit

  • 8/17/2019 16 Review of Part II

    8/17

    - 8 -

    (e) Test the significance of the quadratic effect.

    To test

    0: 20      H    (Including the quadratic term does not significantly improve the model.)

    0: 20      H    (Including the quadratic term significantly improves the model.)

    we use the test statistic t  = – 3.98 with df = n  –  k   –  1 = 36 –  2 –  1 = 33 and P -value = 0.0004.The small P -value indicates that the quadratic effect is significant.

     Notes:

    1. The coefficient of (Units)2, 0.0600 is negative and it makes the parabola bend downward.This produces the decreasing marginal cost behavior, where every extra unit of electricityincurs a smaller cost. Actually, the curve described by the regression equation eventuallygoes downhill for large values of Units, but this part of the curve is irrelevant because the

    company evidently never produces such large quantities.

    2. You should not be fooled by the small magnitude of this coefficient.Remember that it is the coefficient of Units squared , which is a large quantity.Therefore, the effect of the product 0.0600(Units)2 is sizable.

    (f) To examine the possibility for a logarithmic fit, a new variable –  Log(Units), the naturallogarithm of Units –  has been created. The output from a regression of Cost againstLog(Units) is shown on the next page. Interpret the slope of the regression line.

    The estimated regression equation is

     Predicted Cost  = –  63993 + 16654 Log (Units)

    Reminder: If b is the coefficient of the log of x, then the expected change in y when x increases by 1% is approximately 0.01 times b.

    In the present case, you can interpret the slope coefficient as follows.Suppose that Units increases by 1%, for example, from 600 to 606.Then the regression equation implies that the expected Cost will increase by approximately(0.01)(16654) = 166.54 dollars.In words, every 1% increase in Units is accompanied by an expected $166.54 increase in Cost.

     Note that for larger values of Units, a 1% increase represents a larger absolute increase(from 700 to 707 instead of from 600 to 606, say). But each such 1% increase entails the sameincrease in Cost. This is another way of describing the decreasing marginal cost property.

  • 8/17/2019 16 Review of Part II

    9/17

    - 9 -

    Regression output for the logarithmic model

    MultipleR-Square

    Adjusted StErr of

    Summary R R-Square Estimate

    0.8931 0.7977 0.7917 2392.8

    Degrees of Sum of Mean ofF-Ratio p-Value

     ANOVA Table Freedom Squares Squares

    Explained 1 767506900.9 767506900.9 134.0471 < 0.0001

    Unexplained 34 194672188.1 5725652.59

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant -63993.3 9144.3 -6.9981 < 0.0001 -82576.8 -45409.8

    Log(Units) 16653.6 1438.4 11.5779 < 0.0001 13730.4 19576.7

    (g) Compare the quadratic and the logarithmic fits.

    To the naked eye, the logarithmic curve appears to be similar, and about as good a fit,

    as the quadratic curve. However, the values of 2r  , adjusted 2r  , and e s   indicate that the

    logarithmic fit is not quite as good as the quadratic fit:

    Model 2r   2

    ad jr    e s  

    Quadratic 82.2% 81.1% $2,281

    Logarithmic 79.8% 79.2% $2,393

    The advantage of the logarithmic equation is that it is easier to interpret.

    25000

    30000

    35000

    40000

    45000

    50000

    200 300 400 500 600 700 800 900

           C     o     s      t

    Units

    Logarithmic Fit

  • 8/17/2019 16 Review of Part II

    10/17

    - 10 -

    Example 3

    Meddicorp

    Meddicorp Company sells medical supplies to hospitals, clinics, and doctors’ offices. The company currently markets in three regions of the United States: the South, the West,and the Midwest. These regions are each divided into many smaller sales territories.

    Meddicorp management is concerned with the effectiveness of a new bonus program.This program is overseen by regional sales managers and provides bonuses to salespeople basedon performance. Management wants to know if the bonuses paid in 2010 were related to sales.(Obviously, if there is a relationship here, the managers expect it to be a direct –  positive –  one.)In determining whether this relationship exists, they also want to take into account the effects ofadvertising, market share, and competitor’s sales. The variables to be used in the study include:

     y = Sales  –  Meddicorp sales (in thousands of dollars) in each territory for 2010

    1 x = Adv  –  the amount Meddicorp spent on advertising in each territory (in hundreds of dollars) in 201

    2 x = Bonus –  the total amount of bonuses paid in each territory (in hundreds of dollars) in 2010

    3 x  = MktShare –  percentage of the market share currently held by Meddicorp in each territory

    4 x  = Compet –  largest competitor’s sales (in thousands of dollars) in each territory 

    Data for a random sample of 25 of Meddicorp’s sales territories are contained in the file Meddicorp.xlsx

    Territory Sales Adv Bonus MktShare Compet

    1 963.50 374.27 230.98 33 202.22

    2 893.00 408.50 236.28 29 252.77

               24 1583.75 583.85 289.29 27 313.44

    25 1124.75 499.15 272.55 26 374.11

    (a) What is the hypothesized regression model?

                 44332211   x x x x y  

    (b) Interpret the equation of the true population surface.

    The population regression equation

    44332211   x x x x y                

    shows that the conditional mean of y given 4321  and,,,   x x x x  is a point on the four-dimensional

    hyperplane described by 44332211   x x x x              .

  • 8/17/2019 16 Review of Part II

    11/17

    - 11 -

    (c) Below are the least squares regression results. Conduct the F  test for overall fit of the regression.

    Regression of y on1

     x (Adv),2

     x (Bonus),3

     x (MktShare),4

     x (Compet)

    Regression Statistics

    Multiple R 0.9269

    R Square 0.8592

    Adjusted R Square 0.8310

    Standard Error 93.7697

    Observations 25

    ANOVA

    df SS MS F Significance F

    Regression 4 1073118.5420 268279.6355 30.5114 0.0000

    Residual 20 175855.1980 8792.7599

    Total 24 1248973.7400

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept -593.5375 259.1959 -2.2899 0.0330 -1134.2105 -52.8644

    Adv 2.5131 0.3143 7.9966 0.0000 1.8576 3.1687

    Bonus 1.9059 0.7424 2.5673 0.0184 0.3574 3.4545

    MktShare 2.6510 4.6357 0.5719 0.5738 -7.0188 12.3208

    Compet -0.1207 0.3718 -0.3247 0.7488 -0.8963 0.6549

    The hypotheses are:

    0: 43210             H   

    0 oneleastAt:  j    a H   

    Because of the small P -value (0) for the F  statistic (= 30.51) we reject the null hypothesis

    and conclude that at least one of the regression slopes ( 4321 ,,,           ) is not equal to zero.

    This means that at least one of the variables ( 4321 ,,,   x x x x ) is important in explaining the

    variation in y.

    (d) At the 0.05 significance level, test the significance of the relationship between y and each ofthe explanatory variables.

    The P -values for the four t  tests are:

    0.00 for Adv, 0.02 for Bonus, 0.57 for MktShare, 0.75 for Compet

    Thus, the two explanatory variables1

     x  (amount spent on advertising) and2

     x  (amount of bonuses)

    are related to y (sales). The variables3

     x  (market share) and4

     x  (competitor’s sales) are not 

    useful in explaining any of the variation in y (sales) and should be excluded from the model.

  • 8/17/2019 16 Review of Part II

    12/17

    - 12 -

    (e) Below is the regression output for the model with1

     x  (amount spent on advertising) and

    2 x  (amount of bonuses). Interpret the estimated regression equation and its slope coefficients.

    Regression of y on1

     x (Adv),2

     x (Bonus)

    Regression Statistics

    Multiple R 0.9246R Square 0.8549

    Adjusted R Square 0.8418

    Standard Error 90.7485

    Observations 25

    ANOVA

    df SS MS F Significance F

    Regression 2 1067797.3206 533898.6603 64.8306 0.0000Residual 22 181176.4194 8235.2918

    Total 24 1248973.7400

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept -516.4443 189.8757 -2.7199 0.0125 -910.2224 -122.6662

    Adv 2.4732 0.2753 8.9832 0.0000 1.9022 3.0441

    Bonus 1.8562 0.7157 2.5934 0.0166 0.3719 3.3405

    After rounding, the least squares regression equation describing the relationship between salesand the two explanatory variables may be written

     Predicted Sales = – 516.4 + 2.47 Adv + 1.86 Bonus This equation can be interpreted as providing an estimate of mean sales for a given level ofadvertising and bonus payment.

    If bonus payment is held fixed, the equation shows that mean sales tends to rise by $2,470(2.47 thousands of dollars) for each $100 spent on ads.

    If advertising is held fixed, the equation shows that mean sales tends to rise by $1,860(1.86 thousands of dollars) for each $100 of bonus paid.

    -180

    -120

    -60

    0

    60

    120

    180

    800 1000 1200 1400 1600   R   e   s   i    d   u   a    l   s

    Predicted Sales

    Residual Plot

    800

    1000

    1200

    1400

    1600

    350 450 550 650

       S   a    l   e   s

    Adv

    800

    1000

    1200

    1400

    1600

    220 270 320

       S   a    l   e   s

    Bonus

  • 8/17/2019 16 Review of Part II

    13/17

    - 13 -

    (f) The best subsets regression procedure has been performed using all four explanatory variables.Below is a summary of the results. Which is the best model according to the best subsetsregression technique?

    Variables in the Regression k  + 1  pC    2r   2

    ad jr    e s  

    Adv 2 5.90 0.811 0.802 101.42Bonus 2 75.19 0.323 0.293 191.76Compet 2 100.85 0.142 0.105 215.83MktShare 2 120.97 0.001 0.000 232.97

    Adv, Bonus 3 1.61 0.855 0.842 90.75Adv, MktShare 3 7.66 0.812 0.795 103.23Adv, Compet 3 7.74 0.812 0.795 103.38Bonus, Compet 3 68.03 0.387 0.332 186.51Bonus, MktShare 3 76.46 0.328 0.267 195.33MktShare, Compet 3 100.18 0.161 0.085 218.20

    Adv, Bonus, MktShare 4 3.11 0.859 0.838 91.75

    Adv, Bonus, Compet 4 3.33 0.857 0.836 92.26Adv, MktShare Compet 4 9.59 0.813 0.786 105.52Bonus, MktShare, Compet 4 66.95 0.409 0.325 187.48

    Adv, Bonus, MktShare, Compet 5 5.00 0.859 0.831 93.71

    Recall that small values of  pC   and values close to k  + 1 are of interest in choosing good sets

    of explanatory variables.

    There are four competing models with relatively small  pC   values:

    Variables in the Regression k  + 1  pC    2r    2ad jr    e s  

    Adv, Bonus 3 1.61 0.855 0.842 90.75

    Adv, Bonus, MktShare 4 3.11 0.859 0.838 91.75

    Adv, Bonus, Compet 4 3.33 0.857 0.836 92.26

    Adv, Bonus, MktShare, Compet 5 5.00 0.859 0.831 93.71

    The smallest  pC   value is for the regression with Adv and Bonus as explanatory variables.

    It has a  pC   value of 1.61 and explains 85.5% of the variation in sales.

     Note that only modest increases in 2r   are achieved in the other three models.

    The adjusted 2r   is highest and the standard error of estimate is smallest for the regressionwith Adv and Bonus, again supporting this model as best.Therefore, the best subsets procedure suggests using this model.

  • 8/17/2019 16 Review of Part II

    14/17

    - 14 -

    (g) Below is the StatTools output from forward selection, backward elimination, and stepwiseregression when applied to the Meddicorp data with P-value to enter  = 0.05 and P-value to leave = 0.10. Which model appears to be the best?

    Forward selectionMultiple

    R-SquareAdjusted StErr of

    Summary R R-Square Estimate

    0.9246 0.8549 0.8418 90.7485

    Degrees of Sum of Mean ofF-Ratio p-Value

     ANOVA Table Freedom Squares Squares

    Explained 2 1067797.3206 533898.6603 64.8306 < 0.0001

    Unexplained 22 181176.4194 8235.2918

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant -516.4443 189.8757 -2.7199 0.0125 -910.2224 -122.6662Adv 2.4732 0.2753 8.9832 0.0000 1.9022 3.0441

    Bonus 1.8562 0.7157 2.5934 0.0166 0.3719 3.3405

    MultipleR-Square

    Adjusted StErr of Entry

    Step Information R R-Square Estimate Number

    Adv 0.9003 0.8106 0.8024 101.4173 1

    Bonus 0.9246 0.8549 0.8418 90.7485 2

    Backward eliminationMultiple

    R-SquareAdjusted StErr of

    Summary R R-Square Estimate

    0.9246 0.8549 0.8418 90.7485

    Degrees of Sum of Mean ofF-Ratio p-Value

     ANOVA Table Freedom Squares Squares

    Explained 2 1067797.3206 533898.6603 64.8306 < 0.0001

    Unexplained 22 181176.4194 8235.2918

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant -516.4443 189.8757 -2.7199 0.0125 -910.2224 -122.6662

    Adv 2.4732 0.2753 8.9832 0.0000 1.9022 3.0441

    Bonus 1.8562 0.7157 2.5934 0.0166 0.3719 3.3405

    MultipleR-Square

    Adjusted StErr of Exit

    Step Information R R-Square Estimate Number

    All Variables 0.9269 0.8592 0.8310 93.7697

    Compet 0.9265 0.8585 0.8382 91.7508 1

    MktShare 0.9246 0.8549 0.8418 90.7485 2

  • 8/17/2019 16 Review of Part II

    15/17

    - 15 -

    Stepwise regressionMultiple

    R-SquareAdjusted StErr of

    Summary R R-Square Estimate

    0.9246 0.8549 0.8418 90.7485

    Degrees of Sum of Mean of F-Ratio p-Value ANOVA Table Freedom Squares Squares

    Explained 2 1067797.3206 533898.6603 64.8306 < 0.0001

    Unexplained 22 181176.4194 8235.2918

    CoefficientStandard

    t-Value p-ValueConfidence Interval 95%

    Regression Table Error Lower Upper

    Constant -516.4443 189.8757 -2.7199 0.0125 -910.2224 -122.6662

    Adv 2.4732 0.2753 8.9832 0.0000 1.9022 3.0441

    Bonus 1.8562 0.7157 2.5934 0.0166 0.3719 3.3405

    Multiple

    R-Square

    Adjusted StErr of Enter or

    Step Information R R-Square Estimate Exit

    Adv 0.9003 0.8106 0.8024 101.4173 Enter

    Bonus 0.9246 0.8549 0.8418 90.7485 Enter

    Regardless of the procedure used, the result is the same. The equation chosen is

     Predicted Sales = – 516.4 + 2.47 Adv + 1.86 Bonus 

    (h) Medicorp markets in three regions of the United States: the South, the West, and the Midwest.Management of Meddicorp believes that, in addition to advertising and bonus, the regionsit markets in may be important in explaining variation in sales.What is the equation of the regression model that includes the region information?

    Since there are three regions, two indicator variables have to be included in the model.Let Midwest be the base category. Then the two dummy variables are:

    otherwise0

    Southin theisterritorytheif 1South3 x

     

    otherwise0

     Westin theisterritorytheif 1 West4 x

     

    Region 3 x   4 x  

    South 1 0

    West 0 1Midwest 0 0

    The model is

                 44332211   x x x x y  

  • 8/17/2019 16 Review of Part II

    16/17

    - 16 -

    (i) The regression output for this model follows. Interpret the regression equation from the least squares

    Regression of y on1

     x (Adv),2

     x (Bonus),3

     x (South),4

     x (West)

    Regression Statistics

    Multiple R 0.9730

    R Square 0.9468Adjusted R Square 0.9362

    Standard Error 57.6254

    Observations 25

    ANOVA

    df SS MS F Significance F

    Regression 4 1182559.8959 295639.9740 89.0296 0.0000

    Residual 20 66413.8441 3320.6922Total 24 1248973.7400

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 435.0989 206.2342 2.1097 0.0477 4.9020 865.2958

    Adv 1.3678 0.2622 5.2165 0.0000 0.8208 1.9148

    Bonus 0.9752 0.4808 2.0281 0.0561 -0.0278 1.9781

    South -257.8916 48.4129 -5.3269 0.0000 -358.8792 -156.9040

    West -209.7457 37.4203 -5.6051 0.0000 -287.8032 -131.6883

    The regression equation is

     Predicted Sales = 435.0989 + 1.3678 Adv + 0.9752 Bonus  –  257.8916 South  –  209.7457 West  

    The coefficient of each indicator variable represents the difference in the intercept betweenthe base-level (Midwest) group and the indicated group. This can be expressed through theuse of three separate equations:

    South (3

     x = 1,4

     x = 0): 8916.2579752.03678.10989.435ˆ     Bonus Adv y   Bonus Adv 9752.03678.12073.177    

    West (3

     x = 0,4

     x = 1): 7457.2099752.03678.10989.435ˆ     Bonus Adv y   Bonus Adv 9752.03678.13532.225    

    Midwest ( 3 x = 0, 4 x = 0):  Bonus Adv y 9752.03678.10989.435ˆ    

    For given amounts spent by Meddicorp on advertising and bonuses, the estimated mean salesin a territory that is in the South region will be $257,892 (257.8916 thousands of dollars) below the sales in a territory that is in the Midwest region.

    For given amounts spent by Meddicorp on advertising and bonuses, the estimated mean salesin a territory that is in the West region will be $209,746 (209.7457 thousands of dollars) below the sales in a territory that is in the Midwest region.

    -120

    -80

    -40

    0

    40

    80

    120

    900 1100 1300 1500 1700   R   e   s   i    d   u   a    l   s

    Predicted Sales

    Residual Plot

  • 8/17/2019 16 Review of Part II

    17/17

    - 17 -

    (j) Predict the average sales in each region when advertising expenditures equal 500 hundreds ofdollars and bonuses are 250 hundreds of dollars.

    South: 9073.1104)250(9752.0)500(3678.12073.177ˆ    y  

    West: 0532.1153)250(9752.0)500(3678.13532.225ˆ    y  

    Midwest: 7989.1362)250(9752.0)500(3678.10989.435ˆ    y  

    The mean sales figures ($1,104,907 for South, $1,153,053 for West, and $1,362,799 for Midwest)when advertising expenditures equal to $50,000 and bonus payments equal to $25,000 differ accordinto the coefficients of the dummy variables: the figure for South is $257,892 smaller than thefigure for Midwest; the figure for West is $209,746 smaller than the figure for Midwest.

    (k) Determine whether there is a significant difference in sales for territories in different regions.

    Because the location of territories is measured by two variables in a group –  3

     x (South) and

    4 x (West) –  the appropriate test is the partial F  test:

    0: 430        H   

    0and/or0: 43       a H   

    The full model contains the dummy variables; the reduced model does not.

    From the output for the regression of y on1

     x (Adv),2

     x (Bonus),3

     x (South),4

     x (West),

    SSE(full) = 66413.8441 and MSE(full) = 3320.6922

    From the output for the regression of y on1

     x (Adv) and2

     x (Bonus), SSE(reduced) = 181176.4194

    The test statistic is:

    2799.176922.3320

    2877.57381

    6922.3320

    2

    8441.664134194.181176

    (full)

     termsextraof number

    (full)(reduced)

     MSE 

    SSE SSE 

     F   

    n = 25, k  = 4, j = 2; df1 = k – 

     j = 4 –  2 = 2, df2 = n – 

     k – 

     1 = 25 –  4 –  1 = 20 P -value = FDIST(17.2799,2,20) = 0.000044The P -value is very small and we reject the null hypothesis.Thus, at least one of the coefficients of the indicator variables is not zero.This means that there are statistically significant differences in average sales levels betweenthe three regions in which Meddicorp does business.

    (l) How useful is the group of the dummy variables3

     x (South) and4

     x (West)?

    Do they improve considerably the explanation of variation in sales?

    Model 2r   2

    ad jr    e s  

    Adv, Bonus 85% 84% 91Adv, Bonus, South, West 95% 94% 58

    A comparison of 22 , adjr r   , and e s  for the reduced and full models shows that the indicator variables

    3 x (South) and

    4 x (West) carry a lot of explanatory power. They help to explain about 10% more of

    the variation in sales while reducing the standard error of estimate by about $33,000.

    Therefore, the dummy variables3

     x (South) and4

     x (West) should be retained in the model.


Recommended