mihaylofaculty.fullerton.edumihaylofaculty.fullerton.edu/sites/zgoldstein/560/Contents… · Web...

Notes onMultiple RegressionIn this set of notes we present different extensions to the regular multiple regression model formulated by y=β0+β1 x1+β2 x2+…+βk+ε, as well as the requirements on the error variable that need to be satisfied. First a general hypotheses testing procedure is introduced that evaluates parts of the regression model. This procedure is used when the relevance of groups of independent variables is tested. Then we introduce extensions to the regression model that make this tool more versatile and useful. In particular the dummy variable that deals with categorical (qualitative) data is presented, and then we allow for non-linear terms in the model. Finally, procedures to identify violation of assumptions on the error variable are presented, and methods to remedy these violations are suggested.

1

The Partial F-TestSome analyses call for the addition or elimination of a group of variables together from a given linear regression model. To understand the impact of such an operation on the model in question we present the partial F-test. This test will be applied several times throughout this note so it is useful to open with this concept right here.

Recall that the regression model overall usefulness is tested by the F-test where all the coefficients 1, 2, ... ,k are tested together. If the null hypothesis is rejected one can then test each individually using a t-test. The procedure discussed here is designed to test sub-groups of variables. In short, it compares nested models, one is called the complete model and another one is called the reduced model:Complete Model: y = 0+1x1+2x2+ … +gxg+g+1xg+1+ … +kxk+ Reduced model: y = 0+1x1+2x2+ … +gxg+ Note that the reduced model is “nested” in the complete model. The procedure will test whether or not the addition of the group of variables xg+1, … ,xk substantially improves the reduced model, by adding information that was not there before, or equivalently, if discarding a group of variables takes away significant information from the complete model. A summary of the Partial F-test analysis is provided next.Definitions:SSEC = the sum of squared errors for the complete model (when all the variables are included).

SSEC is obtained from the Excel output of the complete model computer run.SSER = the sum of squared errors for the reduced (partial) model. SSER is obtained from the

Excel output of the reduced model computer run.k = the number of variables in the complete modelr = the number of variables in the reduced modeln = total sample sizeThe hypotheses tested are:H0: g+1 = g+2 = … = k = 0H1: At least one beta is not equal to zeroIntuition: If all the tested betas are equal to zero the reduced model and the complete model are identical and there is no improvement achieved by adding the group of variables. If at least one i is not equal to zero than there is a linear relationship between the corresponding xi and y; the group should be added, because the complete model is better.

F-statistic: F=

SSER−SSEC

k−gMSEC

2

Rejection Region: F > F, k-g, n-(k+1)

We’ll use this test in this note several times.

3

Regression with Categorical Data

The dummy variable is a mathematical tool that makes it possible to include non-numerical information in the regression model. This makes the model more useful in decision making setting, where data may or may not belong to a certain category.

Example 1One would like to know the effects on assessed house value of the house size, and whether or not there is a fireplace in the house. Data from a sample of 15 houses was recorded and is provided below:

The independent variable Size is quantitative (ft2), but the variable Fireplace is qualitative (Yes, No). We define a new variable “FirePlace” and let it have the value 1 when there is a fireplace and 0 when there is none. The data set becomes:

4

Value Size(1000ft2) Fireplace84.4 2 Yes77.4 1.71 No75.7 1.45 No85.9 1.76 Yes79.1 1.93 No70.4 1.2 Yes75.8 1.55 Yes85.9 1.93 Yes78.5 1.59 Yes79.2 1.5 Yes86.7 1.9 Yes79.3 1.39 Yes74.5 1.54 No83.8 1.89 Yes76.8 1.59 No

Value Size(1000ft2) Fireplace84.4 2 177.4 1.71 075.7 1.45 085.9 1.76 179.1 1.93 070.4 1.2 175.8 1.55 185.9 1.93 178.5 1.59 179.2 1.5 186.7 1.9 179.3 1.39 174.5 1.54 083.8 1.89 176.8 1.59 0

Now we can formulate a multiple regression model of the form:

y=β0+β1 ¿β ¿2FirePlace+ε

The house value for a house with a fireplace is described by the equation Value=β0+β1¿ β ¿2 (1 )+ε which reduces to Value=(β¿¿0+β2)+β1¿¿¿, while the house value for a house without a fireplace is described by Value=β0+β1¿ β ¿2 (0 )+ε which reduces to Value=β0+β1¿¿. In comparing the two equations the difference is at the intercept (0+2 vs. 0), while the size contribution to the house value (1) is the same for both houses. Since 1 is the slope, we have two parallel lines (see the graph) that describe the linear relationship for the two houses.

Such formulation fits the case where we assume the house value changes with the size at the same rate whether or not the house has a fireplace. When this is not the case additional term should be added to the model (interaction term). This will be discussed later.

After running the model in Excel we get the following output:

5

Price

Size

0 + 2

0

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.900587

R Square 0.811057

Adjusted R Square 0.779567

Standard Error 2.262596

Observations 15

ANOVA

df SS MS FSignificance

F

Regression 2 263.703915 131.852 25.75565 4.55E-05

Residual 12 61.4320854 5.11934

Total 14 325.136

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 200.0905 4.35165794 45.98029 7.31E-15 190.609 209.5719

Size 16.18583 2.57444171 6.287124 4.02E-05 10.57661 21.79506

Fireplace 3.852982 1.24122269 3.104183 0.009119 1.148591 6.557374

The estimated regression equation is: Value = 200.0905 +16.185(Size) + 3.853(FirePlace).

For houses with fireplace the equation is: Value = (200.905+3.853) + 16.18(Size);

For houses without fireplace the equation is: Value = 200.905 + 16.18(Size)

Observing the two equations when distinction is made between houses with and without fireplace, the assessed house value increases on the average by $16,185 for 1000 ft2 increase in the house size. When comparing two houses, on the average a house with a fireplace is assessed $3,853 more than an equal-size house without a fireplace.

In validating the model we see that 81% of the variability in house value assessment is explained by this model, and from the very small significance F of 4.55(10-5) we know that there is strong evidence in the data at least one of the variables is linearly related to the house value. We conclude that the model is very useful.

A linear regression with more than two-level categorical variable

In the previous example we dealt with a quantitative variable who had two possible levels (values), Fireplace = Yes or No. There are cases where a categorical variable has three or more levels. This is formulated by adding more dummy variables. For k levels we define k-1 dummy variables; this is enough to represent all the levels. For example, three levels require two dummy variables. This is so because level 1 = (1, 0); Level 2 = (0,1); and Level 3 = (0,0). So, level 3 is defined as: “not level 1” and “not level 2”. The following example explains the concept.

6

Example 2

The director of a training program for a large insurance company is evaluating three different methods of training underwriters. The three methods are “Traditional”, CD_ROM based, and Web-based. She divided 30 trainees into 3 randomly assigned groups of 10. Before the start of the training, each trainee is given a proficiency test in mathematics and computer skills. At the end of the training, all students take the same end-of-training exam. The results are stored in the file UNDERWRITERS.

Develop a multiple regression model that helps predict the score on the end-of-training exam, based on the score on the proficiency test and the method of training used.

SolutionThe dependent variable is “End score” and the independent variables are “Proficiency” (a quantitative variable) and the “Method” (a categorical variable). This variable has 3 values (TRADITIONAL, CD, WEB), therefore we’ll use 2 dummy variables. We selected (arbitrarily!) the variable CD and WEB to appear in the equation explicitly. This obviously affects the equation, but not the prediction obtained from the equation. The multiple regression models becomes End Score = 0 + 1Proficiency + 2CD + 3WEB + An excerpt from the data file is provided below:

End-Score Proficiency Method End-Score Proficiency CD WEB

14 94Traditiona

l 14 94 0 0

19 96Traditiona

l 19 96 0 0

17 98Traditiona

l 17 98 0 0*** *** *** *** *** *** ***38 80 CD 38 80 1 034 84 CD 34 84 1 043 90 CD 43 90 1 0

*** *** *** *** *** *** ***55 92 Web 55 92 0 153 96 Web 53 96 0 155 99 Web 55 99 0 1

After running the data in Excel we get the following output:

SUMMARY OUTPUT

7

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/Underwriters.xls

Regression StatisticsMultiple R 0.886397R Square 0.785699Adjusted R Square 0.760972Standard Error 9.634874Observations 30

ANOVA


FRegression 3 8849.066 2949.689 31.77489 7.53E-09Residual 26 2413.601 92.8308Total 29 11262.67

CoefficientsStandard

Error t Stat P-value Lower 95% Upper 95%Intercept -86.27 17.03405 -5.06456 2.83E-05 -121.284 -51.256Proficiency 1.125782 0.158856 7.086787 1.59E-07 0.799248 1.452316CD-ROM 30.37672 4.322997 7.026774 1.84E-07 21.49067 39.26277WEB 22.28867 4.31543 5.164878 2.18E-05 13.41818 31.15917

The estimated multiple regression is End Score = -86.27 + 1.125Proficiency + 30.377CD + 22.289WEB.

The prediction equation that pertains to a trainee who uses the traditional method (CD = 0, WEB = 0) is End Score = -86.27 + 1.125Proficiency.The prediction equation that pertains to a trainee who uses the CD_ROM method (CD = 1, WEB = 0) is End Score = (-86.27+ 30.377) + 1.125Proficiency.The prediction equation that pertains to a trainee who uses the WEB method (CD = 0, WEB = 1) is End Score = (-86.27+ 22.289) + 1.125Proficiency

The model explains more than 78% of the variability in the end-of training test scores. The model is very useful (significance F = 7.53(10-9)). Each one of the variables contributes to the prediction power of the model as reflected by the very small p-values of the t-tests.

Let us now interpret the coefficients of this equation:

b1 = 1.125: For every one point increase in the proficiency test the end score is expected to increase by 1.125 points keeping the method of training the same. Comment: Note though that this rate of increase is common to all the methods.

b2 = 30.377: The end score of a trainee who uses the CD-ROM method is on the average higher by 30.377 points than this of a trainee who uses the traditional method, if both

8

trainees had the same proficiency score. To understand this statement note that the equation describing the end score of a person who uses a CD can be written End score (CD) = End score (Traditional) + 30.377, provided the two persons have the same proficiency score (compare the equations above).

b3 = 22.289: The end score of a trainee who uses the Web method is on the average higher by 22.289 points than this of a trainee who uses the traditional method, if both trainees had the same proficiency score.

Predicting with the model:

Suppose we want to predict the end score of a trainee who scored 100 in the proficiency exam and is enrolled in the Web-based training. The prediction is End Score = -86.27 + 1.125(100) + 30.377(0) + 22.289(1) = 48.519.

Usually we’ll predict by means of prediction interval or confidence interval of the mean. Suppose we use 95% confidence level.

o To provide a confidence interval for the person mentioned above (called prediction interval), we’ll use Data Analysis Plus and get the following result: Lower Limit = 27.78; Upper Limit = 69.41. In other words, the 95% confidence prediction for the person’s end score is 48.519±27.74.

o To provide a confidence interval for the mean end score of all the trainees whose proficiency score is 100 and take the Web-based course we use Data Analysis Plus again and get the results: Lower Limit = 42.20; Upper Limit = 55. Note that this interval is smaller than the interval required for the single individual.

A linear regression with more than one categorical variableWhen more than one type of categorization can specify the data collected, more than one set of binary variables is needed, one set for each category type. The following example demonstrates this case.

Example 3

A study was made in Obrien Electronics Ltd. about salary and its relationship to different groups of employees in the firm. The study needed to cover variety of aspects that influence salary

such as years of experience, age, gender, and education. The data collected is recorded in the file SALARY.

(a) Set up a linear regression model that base salary paid on the variables mentioned above.(b) Is there either gender or age ‘pay – discrimination’ in the company?

9

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/Salary.xls

k SSEC MSEC

g

SSER

(c) Based on your answer in part (b) use the better model (apply the parsimony principle if necessary) to predict the salary in one year of a new 35 years old female person who has been working for 10 years now with another company, and who earned an MBA degree.

Solution

(a) The model: Salary = 0 + 1Exp + 2MBA + 3Male + 4Over50 + Variable definitions: EXP = Years of experience MBA = Yes/No (1,0) Male = Yes/No (1,0) Over50 = Yes/No (1,0)

(b) To answer the question about pay discrimination with respect to either age or gender we need to study the relevant (beta) coefficients. The hypotheses we need to set up are:H0: 3 = 4 = 0 H1: At least one is not equal to zero.

These hypotheses are tested using the partial F-test procedure.

The complete model here is: Salary = 0+1Exp+2MBA+3Male+4Over50+The reduced model is: Salary = 0+1Exp+2MBA+

To run the complete model we use all the data set. To run the reduced model we use only the columns of Salary (for the y range), and Exp, MBA (for the x range). The relevant information in the complete model output is:

ANOVA

df SS MS F Significance F

Regression 4 10601.55 2650.387 45.89149 8.39E-10

Residual 20 1155.067 57.75334

Total 24 11756.61

From the reduced model output we have:

ANOVA


Regression

2

10262 5131.00275.5260

7 1.4E-10

Residual 22 1494.61 67.93683

Total 24 11756.61

10

F=

SSER−SSEC

k−gMSEC

=1494.61−1155.067

4−257.75

=2.498

F.05,2,20 = 3.158 (in Excel you can use the function: =FINV(.05,2,20).

Since 2.498 < 3.158 we have insufficient evidence to support H1. This means that 3 and 4 could be equal to zero, so adding the two variables Gender and Over50 does not add information to the reduced model that has not already been there before. In terms of the specific variables, there is insufficient evidence to infer that there are differences in average pay between male and female employees as well as between older or younger workers, if the two employees compared have the same experience and same education.

Comments:

(i) The equation for the complete model is as follows:Salary = 29.186 + 2.815EXP +14.167MBA+8.438MALE – 6.085Over50We can learn, that when keeping the rest of the variables unchanged:a. For each additional year of experience the salary increase by $2815 on the average (p

value = 8.94(10-11), so the variable is a very strong predictor). b. An employee with an MBA degree earns on the average $1,416 more than an employee

without this degree, if they are equal in all other characteristics (p value = 0.00732).

c. From the equation (but see the discussion in comment(ii) below) it seems a male employee earns $8,438 more than a female workers with similar characteristics (p value = 0.028)

d. From the equation (but see the discussion in comment(ii) below) it seems an employee over 50 years old earns on the average $6,085 less than a similar employee whose age is younger than 50.

(ii) It is interesting to observe the complete model and see the significance of each individual variable “Male” and “Over50”. When testing the significance of the variable “Male” the p-value is .0286. This value is smaller than .05, so at 5% significance level we can argue that the variable “Male” contributes to improving a model that contains all the variables but “Male” (which means the variable Over50 is already ‘in’). Similarly, the p-value for the test of “Over50” is .17, so this variable does not improve a model that already includes the variable “Male”. The result that we obtained in the partial F-test answered a different question. Should we include any of these variables if none have been added yet! The answer was no.

11

(c) Based on the parsimony principle, we’ll use the smallest model that is still adequate. Thus we select the reduced model. The equation is:

(d) Salary = 33.47 + 2.79Exp + 11.65MBA = 33.47 + 2.79(10) = 11.65(1) = 73.062.

Different types of difficulties may exist when trying to obtain reliable regression models. These difficulties may result from violation of basic assumptions about the data and the regression model that need to be met. These will be dealt with later. For now we look at difficulties that result from characteristics of the input data set. For example mathematical difficulties created by X variables that have different magnitudes which cause problems in the mathematical procedure that produces the equation itself; or “physical difficulties” that stem from the different units defining the variables (making it hard to compare the regression coefficients).

Remedy to such problems many times is found by using transformations either on the dependent variables, the independent variables, or both. This means that instead of regressing on the original variables, we are regressing on functions of the original variables. Which transformation to use depends on the particular problem (sometimes a transformation can find cure to more than one problem). We’ll demonstrate a few transformations throughout this note document. Let us start with the standardizing transformation. The Standardized RegressionThis transformation may help obtain a good model when the variables are measured on different scale of magnitude, and helps in providing meaningful comparisons between effects independent variables have on the dependent variable.

The transformation is based on centering the variables and then measuring their value in terms of standard deviations. This way we rescale all the variables to spread out around the interval [-3, 3] which helps control the amount of error, and also express the “” coefficients in the same units.

To transform the y coefficients, we use the following formula:

Y '=1

√n−1 (Y−YSY )

where Y is the mean value of Y, and SY is the standard deviation of Y.

To transform each x coefficients, we use the following formula:

X '=1

√n−1 ( X−XS x )

12

where X is the mean of each independent variable, and Sx is the standard deviation of X calculated for each Xi.

After finding the Y' and X' transformed variables, the standardized regression model is:

Y '=β1' X1

' + β2' X2

' +…+βk' X k

' +ϵ

From this model, we can show that the standardized regression parameters are related to the original regression coefficients by scaling factors. Specifically, the coefficients are related as follows:

β i=( SYSi ) β i'

where (i = 1,......,k) and k= number of predictor variables, and

β0=Y−β1X1−β2 X2−…−βk X k

Example 4

a. Develop a multiple regression model to predict the appraised house value based on the land area of the property, the house size, and the property age. Answer the following questions:(a) Why the regression coefficient b0 has no practical meaning in the context of this

problem.(b) Interpret the beta coefficients. Data appears in the file APPRAISAL.

b. Note that the measurement units of the independent variables markedly differ. Also note there is some difficulty determining to which variable the house appraisal is more sensitive. Develop a standardized regression model and interpret its coefficients.

Solution

a. From Excel we get the following model:Appraisal = 139.79 + 276.08Acres + .129Size – 1.399Years.

(a) The coefficient b0 = 139.79 could be interpreted as the appraisal value of a ‘no-land’ property, whose size is 0 (?), that has just been built (Age = 0). Obviously this is a “twisted” way of interpretation. Besides, those values of the independent variable are outside the range covered by the sample observations, and therefore are not to be used.(b) b1 = 276.08. Keeping house size and age fixed, every additional acre adds $276,080 to the house appraisal. B2 = .129. Keeping the other variables unchanged, every additional sq-foot contributes $0.129 to the house appraisal. B3 = Keeping the other variables unchanged, every additional year of age reduces the property appraisal by $1.399.

13

b. Indeed the coefficients cover a wide range of values (from -1.399 to +276.08). In addition, if one wants to compare changes in the appraisal based on land size, house size, and age, this becomes a bit tricky because of the different units. In the current model it seems land is the most effective variable, but as we redefine the variables by transforming them to their standardized counterparts, the picture becomes different.The standardized transformation: For each variable we calculate its mean and standard deviation. Using Excel we have: Mean Appraisal = 389.85; Standard Deviation Appraisal = 120.388Mean Acres = .246 Standard deviation Acres = .137Mean Size = 1978.83 Standard Deviation Size = 550.87Mean Age = 50 Standard Deviation Age = 22

Now we standardize all the variables. For example, the first observation is:

Transformation: ( 1√30−1 ) 466−389.85120.388 ( 1

√30−1 ) .23−.246.137 Fill in Fill in

Observing the transformed data, note how all the variables cover about the same range of values. Here is an excerpt:

After running the transformed data we have: Appraisal’ = .315Acres’ + .589Size’ -.261Age’.

From the standardized equation we see that the house size has most of the contribution to explain the house appraisal value. This statement is meaningful because the coefficients express the contribution to the standardized appraisal for one unit increase in the standardized variable!

14

Appraised Value Land (acres) House Size

(sq ft) Age466 0.2297 2448 46

Appraised Value

Land (acres)

House Size(sq ft) Age

0.11746 -0.02245 0.158152 -0.03248

-0.03987 -0.03665 -0.01242 0.008807

0.060389 -0.11268 0.031743 -0.17284

Regression Model with interaction

There are cases where the rate at which y changes when one independent variable, say x1 increases by 1 unit depends on the particular value of another independent variable, say x2, we say there is interaction between x1 and x2. In this case we need to add a new variable to the regression model that will take on this change of slope effect. We do it by adding the term X = x1x2 to the model, which in general becomes y = 0 + 1x1 + 2x2 + … + kxk + k+1x1x2 + k+2x1x3 + … +

Note that not all the possible products need to be included, only those relevant to the case studied. Higher level products such as x1x2x3 may be included too, but will be omitted here.

Let us demonstrate the concept and the effects of such an interaction on the linear relationship by two examples. One includes quantitative variables only and the other one categorical variable as well.

Example 5

A consumer organization wants to develop a regression model to predict the MPG (miles per gallon) based on the horse power of the car’s engine and the car weight (in pounds). A sample of 50 recent car models was selected and relevant data recorded and saved in the file MPG. Here is an excerpt of this data file:

Initially the model formulated was MPG = 0 + 1HP + 2W + Then an analyst working for the organization suggested the rate at which gas consumption changes with the car engine horse power depends on the car weight. That is horsepower and weight might interact to affect the MPG. A new model was formulated: MPG = 0 + 1HP + 2W + HP)(W) + and a new column was added to the data file to express the interaction values in the sample. Here is an excerpt:

MPG Horsepower Weight HP*W43.1 48 1985 9528019.9 110 3365 37015019.2 105 3535 371175

After the data set was run on Excel again the following output was obtained:

15

MPG Horsepower Weight43.1 48 198519.9 110 336519.2 105 353517.7 165 3445

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/MPG.xls

16

SUMMARY OUTPUT


Multiple R 0.894047

R2 0.799319

Adjusted R2 0.786232


Observations 50

ANOVA


Regression 3 2615.247 871.748 61.0733 4.48E-16

Residual 46 656.5952 14.2738

Total 49 3271.842


Error t Stat P-valueLower95%

Upper 95%

Intercept 85.07138 8.313213 10.2332 1.94E-13 68.33775 101.805

Horsepower -0.45077 0.102861 -4.382 6.75E-05 -0.65782 -0.24372

Weight -0.01524 0.00278 -5.481 1.72E-06 -0.02083 -0.00964

HP*W 0.0001 2.97E-05 3.38210 0.00147 4.07E-05 0.00016

The equation received is MPG = 85.07-.45HP-.015W+.0001(HP)(W). Almost 80% of the variation in MPG among the cars is explained by this model (r2 = .799); the model is very useful because Significance F = 4.48(10-16), an extremely small p-value. This means that at least one of the independent variables is linearly related to MPG. The contribution of the interaction to the model can be answered by testing the coefficient 3 of the interaction term. Clearly, if 3 = 0 there is no affect to the interaction, so we set up the two hypotheses:

H0: 3 0H1: 3 0The p value of this t-test is .00147. At 5% significance level the null hypothesis is easily rejected. There is overwhelming evidence to infer that 3 is not equal to zero. This translates to the conclusion that including the interaction term reduces the errors thus contributes significantly to the precision of the model, in the presence of the other variables!

Interpreting the coefficient in the presence of interactionCaution must be employed when interpreting the coefficients of the regression when interaction is involved. Although the main effect variables HP and W are present, one should not explain their coefficients as the rate of change of the dependent variable with respect to each independent variable. More specifically, the coefficient b1 = -.45 is not the amount of reduction in MPG per one unit increase of HP. To understand why, note the equation when W = C (C is some constant).

17

MPG = 85.07-.45HP-.015C+.0001(HP)(C) = (85.07 - .015C) + (-.45+.0001C)HP. So the rate of change of MPG with respect to HP is -.45+.0001C (and not -.45)! The relationship between HP and MPG is linear, but both the intercept and the slope of the line change with C (that is, both depend on W). As an illustration observe the relationships for two values of W: 2500, 3500.

(i) MPG = 85.07 - .015*2500 + (-.45 + .0001*2500)HP = 47.57 - .2HP(ii) MPG = 85.07 - .015*3500 + (-.45 + .0001*3500)HP = 32.57 - .1HP

The following is a graphical description of the two lines. As you can see both the intercept and the slope are changing when the car weight is changing.

0 50 100 150 200 2500

5

10

15

20

25

30

35

40

45

50

MPG(2500)MPG(3500)

Important: If an interaction term was added to the model, and the model was found overall useful by the F test, you should test the significance of the interaction next. If the interaction contributes significant information (tested by the t-test), there is no need to further test the main effects of the variables that interact. The interaction already constituted their importance.

We’ll revisit this model later as one of the summary problems.

18

Example 6

Zagat’s publishes restaurant ratings for various locations in the US. The file RESTAURANTS contains the Zagat ratings for food, décor, service, and cost a customer is paying on the average for 50 restaurants located in an urban area and 50 restaurants located in suburban area.

a. Develop a regression model to predict the price per person based on the rating of food, décor, service

b. Add the location variable (urban area or not) to the model developed in part a. Test if this variable statistically improve the first model.

c. Is a model with all three interactions of two rated variables (Food and décor, food and service, décor and service) included is better than the model developed in part b?

Solution

a. The model developed results with the following equation. Cost = -10.9435 + .0179FOOD + .9241Decor + 1.6697Service. The model is very useful since Sig. F = 1.45(10-12), but in order to make predictions we would like to improve the coefficient of determination currently at r2 = .45. The variable FOOD is insignificant (astonishingly) in predicting the cost to the customer (p-value = .96) which should be interpreted as this variable does not add information to an existing model that includes the other two variable (Décor and Service).

b. When adding the categorical variable “Location” (0 when it is an urban area, 1 when it is suburban area), the equation is Cost = -11.063 - .0198FOOD + .9146Decor + 1.9317Service – 7.7032LocationThe location variable adds significant information to the previous model (p-value = 1.92(10-9), and the prediction power of the model is now higher since r2 = .626. On the average a customer is charged $7.70 less in a non-urban restaurant that has the same food, décor, and service rating.

c. When adding three interaction variables we need to add three columns to the input data. Here is an excerpt:

19

(19)(21) = 399

Food Décor Service Locate

Food*Decr Food*Servc

Décor*Servc Price

19 21 18

0

399 342 378 5018 17 17 0 306 306 289 3819 16 19 0 304 361 304 43

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/RESTAURANTS.xls

After running the data we get the following output

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.815692R Square 0.665353Adjusted R Square 0.639891Standard Error 5.522773

Observations 100

ANOVA

df SS MS FSignificanc

e FRegression 7 5579.146 797.0209 26.13096 2.38E-19Residual 92 2806.094 30.50102

Total 99 8385.24

Coefficients Standard Error t Stat P-valueLower 95% Upper 95%

Intercept 52.87324 33.62885 1.572258 0.119324 -13.9166 119.663Food -6.28111 2.211796 -2.83982 0.005556 -10.6739 -1.88829Décor 2.648502 2.222077 1.191904 0.236364 -1.76473 7.061738Service 0.210025 3.09936 0.067764 0.946121 -5.94557 6.36562Locate -7.79361 1.127871 -6.91002 6.19E-10 -10.0337 -5.55356Food*Decor 0.102159 0.154386 0.661713 0.509809 -0.20446 0.408782Food*Servc 0.257491 0.115239 2.234416 0.027878 0.028617 0.486365

Decor*Servc -0.21209 0.113407 -1.87013 0.064645 -0.43732 0.01315

The equation is Cost = 52.873-6.26Food+2.64Decor+.21Service-7.79Locate+.102Food*Décor+.257Food*Service - .212Decor*Service.

The prediction power has improved somewhat (r2 = .66), the model is very significant (2.38(10-19), and now we need to decide about the relevance of the interactions. We have added three terms at once (probably because the analyst had no prior suspicion about any one of them), so we can test whether or not these three variables add information that was not there before. Such a test is advantageous because we can determine whether or not the added interaction group as a whole was useful. To perform the test we’ll formulate the following two hypotheses:H0: 5 = 6 = 7 = 0 (These are the coefficients attached to the three interaction terms.)H1: At least one beta is not equal to zero.

We use again the “Partial F-Test”. The complete model is the one we ran here and the reduced model is the one we ran in part b. For convenience we’ll re-formulate these models and use the previous runs to obtain SSER, and SSEC.

20

The reduced model: Cost = 0+1FOOD+2Decor+3Service+4Location+The complete model: Cost = 0+1FOOD+2Decor+3Service+4Location+5F*D+6F*S+7D*S+

Observing the two computer outputs we have: SSER = 3313.339; SSEC = 2806.094; k = 7; g =4.MSEC = 30.501

FStat = [(3313.339 – 2806.094)/(7-4)]/30.501 = 3.57; F .05,3,92 = 2.70

Since FStat > F we can reject the null hypothesis and infer that adding interaction terms add information to the reduced model, making the complete model better. We don’t answer the question which interaction(s) contributes to the model improvement. For this additional analysis is required.

Transformations and non linear terms

Usually transformations are linked to the inclusion of non-linear terms in the regression model. This is needed under two scenarios: (i) The relationship between an independent variable and the dependent variable is non-linear. (ii) Required conditions are violated. Transformations can be done by using functions of the original variables or by adding nonlinear terms to the model. For practical reasons let us restrict the discussion to a quadratic, square root, and logarithmic transformations. A partial list of popular transformation is provided at the end of this section.

A few examples will demonstrate how to transform the data and make predictions.

Example 7

The following market research study was conducted by a national chain of consumer electronics stores. To promote sales, the chain relies heavily on local newspapers advertising. A sample of 20 cities with similar populations and monthly sales totals were assigned different advertising budgets for one month. Sales were recorded and saved in the file AD-BUDGET.

Answer the following questions:

(a) Construct a scatter plot for advertising and sales(b) Fit a quadratic regression model and state the resulting equation(c) Predict the mean monthly sales for a city where the budget allocated to advertising is

$20,000(d) At 5% level of significance, is the quadratic model useful?(e) At 5% level of significance, is the quadratic model provides a better fit than the linear

model?

21

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/Ad-budget.xls

Solution

a. The plot:

0 5 10 15 20 25 305.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

7.2

Sales

Sales

It seems a (concave) quadratic relationship exists between budget and Salesb. The quadratic equation is obtained from the computer run:

SUMMARY OUTPUT


Multiple R 0.934407

R Square 0.873117



Observations 20

ANOVA


Regression 2 1.438564 0.719282 58.49061 2.39E-08

Residual 17 0.209056 0.012297

Total 19 1.64762


Intercept 5.7845 0.11892 48.64188 1.08E-19 5.5336 6.0354

Budget 0.088786 0.018125 4.898518 0.000136 0.050545 0.127026

Budget-sq -0.00174 0.000593 -2.94028 0.009146 -0.00299 -0.00049

The equation is: Sales = 5.7845+0.0888Budget-.00174Budget2.c. Sales (Budget=20K) = 5.7845+.0888(20) - .001748(202).d. Test H0: 1 == 0

H1: At least one beta is not equal to zero

22

The significance F = 2.39(10-8), which is extremely small. The null hypothesis is rejected, so the quadratic model is useful (clearly at 5% significance level).

e. Test: H0: 2 0 H1: 2 0The p-value for this test is .0091 (very small). The null hypothesis is rejected (at 5% significance level), which indicates that the quadratic term improves the fit of the model.

The following example employs more than one independent variable, and calls for a quadratic term too.

Example 8

An ergonomist designed a study in which tomatoes were grown using six different amounts of fertilizer per 1000 ft2. These quantities were randomly applied to six different plots of land. The results including the yield of tomatoes (in pounds) were as follows:

a. Produce a scatter plot of the data.

b. Fit a square root function to the data where the independent variable (Amount) is transformed into the square root of the amount.i. State the equation. ii. Predict the mean yield for a plot of land fertilized with 70 pounds per 1000 ft2.

23

Plot Amount (lbs) Yield

Plot Amount(lbs) Yield

1 0 6 7 60 462 0 9 8 60 503 20 19 9 80 484 20 24 10 80 545 24 32 11 100 526 40 38 12 100 58

Solution

a. The plot:

0 20 40 60 80 100 1200

10

20

30

40

50

60

70

Yield

Yield

From the graph it seems there is some curvature toward the upper end.b. The data is transform into

(i) After running the data in Excel we get the equation: Yield = 4.666+5.068*sqrt(Amount). Observe the graph below and notice how the input data is now forming a straight line (almost). This is why the line fit is very good (r2 = .94)

0 2 4 6 8 10 120

10

20

30

40

50

60

70

YieldPredicted Yield

(ii) Yield(70K) = 4.666 + 5.058*sqrt(70)

Comment: We could receive a good result by adding a quadratic term as follows: Y = b0+b1Amount + b2Amoint2. This would work because of the parabola-like curvature of the

24

Plot Amount (lbs) Yield

Plot Amount(lbs) Yield

1 0 6 7 etc. 462 0 9 8 * 503 Sqrt(20) 19 9 * 484 Sqrt(20) 24 10 * 545 Sqrt(40) 32 11 * 526 Sqrt(40) 38 12 * 58

data.

In the following example we’ll add interaction terms along with quadratic terms to the model.

Example 9

A study of psychological response of firefighters to “chemical fire” tried to explain the level of emotional distress (y) by the number of years on the job (x1) and whether the firefighter was exposed to chemical fire (x2 = 1) or not (x2 = 0). The model suggested was Y = 0 + 1x1 + 2x1

2 + 3x2 + 4x1x2 + 5x12x2

+ As you see both first order and second order interaction terms are included in the model, as well as quadratic term for the number of years on the job. Answer the following general questions.

a. What hypothesis would you test to determine whether the rate of increase of emotional stress with experience is different for the two groups of firefighters?Answer: H0: 4 = 5 = 0, because the two interaction terms are responsible for changing the slope of increase of y for different values of x2. For example, if x2 =1, the rate at which y increases with x1 is 1+22x1+b4+25x1; but if x2=0, the rate at which y increases with x1 is 1+22x1. If H0 is rejected then at least one interaction term stays in the model and the rate at which y increases is affected. The actual analysis employs the partial F-test as explained above.

b. What hypothesis would you test to determine whether there are differences in mean emotional distress that are attributed to exposure groups?Answer: H0: 3= 4 = 5 = 0, because exposure groups are defined by the variable x2. So we test the coefficients of all the terms in the equation where x2 appears. The actual analysis employs the partial F-test as explained above.

Two more transformations are suggested that may help spread out data more evenly, thus may prevent situations where data point is/are influential. This is an issue discussed in the next section, thus two demonstrations of such transformations will be shown later. The first transformation scatter the data that is concentrated mainly in one side of that data range, and the second transformation can take care of data concentrated in the middle of the data range.

25

A list of other Transformations1. Log transformation: Y’ = Log(Y) for Y > 0; It is used when (a) the variance of the error

increases as Y increases. (b) The distribution of the error variable is positively skewed.2. Square transformation: Y’ = Y2; It is used when (a) The variance is proportional to the

expected value of Y. (b) The distribution of the error variable is negatively skewed. 3. Squared-root transformation: Y’ = √Y for Y > 0. Use when the variance is proportional

to the expected value of Y.4. Reciprocal transformation: Y’ = 1/Y. Use when the variance appears to significantly

increase when Y increases beyond some critical value

26

Diagnostics of the linear regression model

Our diagnostics section relates to two main issues: Problems with the data sets values Unsatisfied assumptions required for the linear regression model

Problems in the data setProblems might occur when there are unusual data points, which may affect the regression model. It is important to identify such problems and consider eliminating them. Two types of influential observation can be present: a. Outliers (“vertical influence”): Outliers may affect the “height” where the equation resides

that is, change the equation intercept. We present two ways a data point can be determined to be an outlier:i. The standard residual is larger than 3. [Standard residual is calculated by Excel as

follows: Standard residual = Residual/Se, where Se is the standard deviation of the residuals. This is not quite correct to calculate standardized errors this way, but it is good enough in order to determine which observation is an outlier]. This approach is subjective, but three standard deviations are shown to be enough in order to identify an observation as an outlier.

ii. The Dummy variable approachObservation ‘i’ may still be considered an outlier even if its standard residual is smaller than 3 if it can be shown a correction to the model that relates to Yi makes the model better. Add Dummy variables to the model, one per each suspected point. For the sake of simplicity assume there is one suspected outlier. Then the model becomes Y = 0+1X1+…+kXk+(Dummy)+. The variable Dummy gets the value 1 only for the suspected observation ‘i’ and zero otherwise. If point ‘i’ is indeed an outlier then Y i falls far from where it would be expected based on the model. This is when the correction of should improve the fit of the model (taking the prediction closer to the outlier’s Y value). If the observation is not an outlier such a correction is not needed. Thus we need to test H0: = 0; H1: 0. Rejecting H0 in favor of H1 means a model with a correction of is significantly better. That is, such a correction is needed and thus the point is indeed an outlier. An example demonstrating the application of this method is shown below.

Caution:In interpreting the test result one should be careful with levels of significance. When multiple observations are suspected to be outliers, if the suspect observations had been picked in advance then the tests would be valid. If the suspect observations have been selected after looking at the data, however, the nominal significance level is not valid, because we have implicitly conducted more than one test. A series of tests ends up with an overall alpha larger than what is desired. A very simple procedure to control the overall significance level when you plan to conduct k tests is to use a significance level of α/k for each one. A basic result in probability theory known as the Bonferroni inequality guarantees that the overall significance level will not exceed α.

27

b. Large leverage (“horizontal influence”): When a few data points appear at the end of the X data range (far from their mean) they influence the slopes thus the whole linear regression. This points are not necessarily outliers because their “vertical distance” from the equation might be small (remember, these points are influential, so the prediction equation will adjust to reduce the error they create). Influential variables inflate the coefficient of determination (sometimes dramatically) and may produce an impression of an excellent fit. In other words an influential point has the potential of changing the regression equation. To determine whether a data point is influential we can run the regression twice, once with the influential point and once without it, and then observe the two equations. If the second equation is very different from the first equation then the point is influential.

The effects an influential point may cause are demonstrated in the chart next.

0.700000000000001 0.900000000000001 1.1 1.30

5

10

15

20

25

Without WithData

The squared dots represent the regression equation when the influential point was included. The diamond dots represent the regression when the influential point was omitted.

28

There are several methods to check the leverage of an observation. We present only one, which can be used to detect both an outlier and a large leverage.

The Cook’s Distance ApproachCook’s distance measures the effect of deleting an observation on the regression model. The measure is calculated for each observation ‘j’ as follows:

Di=∑j=1

n

(Y j−Y j (i ))2

(k+1)MSEDefinitions:Y j = the predicted observation ‘j’ obtained from the regression model when run over all the data setY j( i) = the predicted value of observation ‘j’ obtained from the regression model when run over

all the data set except observation ‘i’.MSE = Mean sum of squared errorsk = number of independent variables

If Di >F.50(k+1,n-k-1) then observation ‘i’ is considered influential that deserves a second examination.

In general, Di is computed for all the observations and the influential points are then identified. This work is omitted here being tedious (statistical software include this task as a regular feature). Since our intention is to explain the concept of influence, and to mention some of the tools available to take care of it we’ll focus on given suspicious points only in the examples provided later.

Influential observations need to be investigated for possible data entry mistakes, or even relevance to the problem. If it is determined they should stay transformations might be used as indicated below.Remedy to the influential point effect: Use transformation of the logarithmic type. This will scatter the data better and also help in the ‘constant variance’ requirement of the error variable (discussed later).

The following example demonstrates the above procedures.

Example 10 The Federal Trade Commission annually ranks varieties of domestic cigarettes according to their tar, nicotine, and carbon monoxide (CO). Past studies have shown that increases in the tar and nicotine contents are accompanied by an increase in the carbon monoxide emitted from the

29

cigarette smoke. The file SMOKE HAZZARD contains data for 25 brands tested in a recent year. Suppose we want to model the level of CO as a function of Tar, Nicotine, and Weight as follows: CO = 0 + 1Tar + 2Nicotine + 3Weight + Produce the linear regression and identify outliers and influential variables.

SolutionThe Dummy variable approach: From the Excel output we don’t identify any outlier. It is advisable to start the search for influential points by checking the conditioning of the different variables. Here are the three graphs of the three independent variables vs. the dependent variable:

0 5 10 15 20 25 30 350

5

10

15

20

25

CO-Tar

0 0.5 1 1.5 2 2.50

5

10

15

20

25

CO-Nicot

30

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/contents/Multiple%20Regression/Smoke%20hazzard%20Other%20transformations.xls

0.700000000000001 0.900000000000001 1.1 1.30

5

10

15

20

25

CO-Weight

Let us use the Dummy variable approach. In all the three graphs we detect point 25 as unusual. We add one dummy variable that gets the value 1 only for this point. Here is an excerpt from the new data set:

After running the data in Excel and testing H0: Dummy = 0; H1: Dummy 0, the p value = .0019. See an excerpt from the Excel output:

Coefficients P-valueIntercept -0.5516976 0.854569TAR 0.88758035 0.000199NICOTINE 0.51846956 0.874941WEIGHT 2.07934422 0.520431Dummy -5.8731259 0.001991

This indicates to us that the dummy variable significantly changes the model, in other word the model obtained when Dummy = 0 is significantly different from the model obtained when Dummy = 1. We can conclude that with this point included the model intercept reduces (for example by 5.873) creating a new model. So one (lonely) point is making a lot of noise, which justifies it be considered influential.

The Cook’s Distance measureLet us run the regression with and without point 25 (model 1 and model 2 respectively). Model 1: CO = 3.202 + .962Tar – 2.63Nocot - .13WeightModel 2: CO = -.5517 +.887Tar + .518Nicot + 2.08WeightNot scientific, but striking: The two models really look different!

31

CO TAR NICOTINE WEIGHT Dummy1 13.6 14.1 0.86 0.9853 02 16.6 16 1.06 1.0938 0* * * * * *

25 23.5 29.8 2.03 1.165 1

From Excel we summarized the relevant results:Point Y (model 1) Y (model 2) (Y (model1)− Y (model 2 ))2

1 14.382 14.458 (14.382-12.458)2= .0056492 15.671 16.473 (15.671-16.473)2= .6349……………………………………………………………………………………………………………………..

25 26.392 29.373 (26.392-29.373) 2 = 8.883 SUM = 17.50

MSE(model 1) = 2.09; k = 3. D25 = 17.50/[(3+1)(2.09)] = 2.07. Since D.50(3+1,25-3-1) = .866 we determine point 25 is influential (1.38>.866).

Remedy to the influence problem: To take care of the influence situation, we assume this point is relevant, so we would like to keep it but try to find a transformation that scatters the data better. The transformation we choose is natural logarithmic of all variables, dependent and independent.The transformed model: Ln(CO) = 0 + 1Ln(Tar) + 2Ln(Nicotine) + 3Ln(Weight) + The equation obtained is: Ln(CO) =.0429 + .989Ln(Tar) - .2Ln(Nicotine) - .044Ln(Weight)

We can re-apply the dummy variable method to the logarithmic model. The t-test for Dummy results with p-value = .22. By this approach this point is not influential anymore. Applying the Cook’s distance approach we come up with the same conclusion.

As in the example above, there are cases where we would like all the points, including influential points stay in the model. We can make an attempt to keep these points while reducing their influence by transforming variables. Two such transformations are now presented, shown to work in cases where data is not evenly spread along the data range.

Observe the following example:

Because of the unbalanced way the data is organized horizontally one point on the right (and possibly more than one) might be influential. We can try helping this situation by spreading the data in a different fashion horizontally. More specifically, “pull” data points away from the origin ‘right-wise’ more rapidly for small values of X than for larger

32

Note: Model 2 is calculated omitting observation 25. Thus we

need to calculate Y (point 25, model 2) using the equation for model 2: -.5517+.887(29.8)+.518(2.03)+…

X

values of X, while keeping their relative position. Such a change can be done by using several transformations of the exponential form, the logarithmic form, and reciprocal form. One such transformation is based on the Pareto Distribution.

33

The Pareto Transformation:

This distribution is described by F (X )=1−( X0/X )α for all X ≥ X0 . This is a concave and monotonically increasing function.

Note how the rate at which F(X) is changing is fast at the beginning (for small X values) and becomes more moderate as X increases. So if we substitute F(X) for X, we’ll achieve the ‘stretching’ effect of the data points right-wise. This can be done for both X and Y if needed. To select X0 we can set X0 = min(Xi), where Xi are the sample values of the variable X.

0 2000 4000 6000 8000 10000 120000

100

200

300

400

500

600

700

800

900

1000

In this example both X and Y is transformed using the Pareto transformation, because both suffered from the condensation problem. F(X) = 1 – (40/X).05, and G(Y) = 1 – (72/Y).05 were found to provide a good data spread.(40 = min(X) and 72 = min(Y)).

34

X

F(X)

X0

1

Observe:

14 15 16 17 18 19 2019

20

21

22

23

24

25

26

Pareto

The value of alpha = .05 was selected by trial and error, producing (what seemed to provide) a ‘nice’ spread. It is not necessary to use the same factor alpha for X and Y.Comment. For the sake of regression analysis, the actual transformation can omit the value ‘1’ of the Pareto transformation, because this is taken care of by the linear regression mechanism (note that we have an intercept that will account for the constant). So the two transformations used in our example were F’(X) = (40/X).05 and G’(Y) = (72/Y).05.

While the Pareto transformation can be used when the data is grouped near one of the data range ends, the following transformation can take care of a situation where the data is grouped near the center of the data range.

The Normal Transformation

The following plot demonstrates a data set that is congregated in the mid-range.

The normal distribution has the following shape:

35

F’(X)

G’(Y)

Again, we might face with influential points on the two ends of the data range. A more spread-out data set should prevent this and in many cases also stabilizes the variance. We’ll turn to the normal distribution to get help in this situation.

It turns out that if we use the transformation F(X) instead of X in the regression model, the corresponding X values will move to the right and to the left away from their mean faster for values close to the mean than distant values. To apply the method, we can use the Excel function normdist. For we calculate =average(x), and for we calculate =stdev(x). Then for each value of x we substitute =normdist(x,average(x),stdev(x),1). Observe the example:

Example: A study was conducted in order to determine whether the assessed house price and the expected time in the market are good predictors of the price the house is sold. The study was conducted in certain neighborhoods populated by same economical class households, same house ages, and some other same features thought to be factors in determining house prices. See data in HOUSE PRICE (file is missing). Observing the variables lay-out we realize there is a problem with the variable Assessed Prices.

145.00 155.00 165.00 175.00 185.00 195.00 205.00 215.00145.00

155.00

165.00

175.00

185.00

195.00

205.00

215.00

Assess-Price

0 2 4 6 8 10 12 14 16 180.00

50.00

100.00

150.00

200.00

250.00

Months-Price

It seems there are influential points on the two edges of the assessed price range (especially for large price assessments). The data seem to reside mainly in the center, both with respect to the

36

F(X)

1

assessed price and the price. We’ll try a normdist transformation, but first let us check the significance of the linear model Price = 0+1Assessed+Months+

R2 = .69; Sig. F = 1.17(10-7). The model is significant and at least one variable can be considered a predictor. Both variables seem to be significant: p-value(Assessed) =.0111; p-value(Months) = .0015. However, because of the suspected influence of some observations, the model might look too good!

Influence analysis: We use the dummy variable approach to evaluate the influence of several points. These points are colored in red (see the graphs above). We substitute the value ‘1’ for the dummy variable at these points (together) and ‘0’ for all the other points. The dummy variable beta coefficient was tested and was concluded to be ‘not zero’ (p-value≈0). This meant that the model is significantly different when these points are excluded, so at least one of them must be influential. Comment: If a single beta was sufficient to show the influence of a group of points, let alone a dedicated beta when attached to a single point individually.

Transformation: To improve the model let us use the normal transformation on both “Assess”,

“Months”, and “Price”.

37

See an excerpt of this operation below.

Price Assessed Months NORM

Price

NORM

Assess

NORM

Months

169.00 166.28 16 0.133777188 0.123633667 0.95841806

198.00 210.00 2 0.84092744 0.99961445 0.04926426

192.80 184.00 11 0.732526439 0.750150697 0.69979613

194.50 179.74 3 0.771557253 0.592719187 0.07923011

=normdist(166.28,177.47,9.672,1)

Average 184.26 177.47 8.83 =normdist(169,184.26,13.763,1)

Stdev 13.763 9.672 4.136

The resulting output is: SUMMARY OUTPUT


Multiple R 0.900236

R Square 0.810426



Observations 30

ANOVA

38


Regression 2 1.985796 0.992898 57.71213 1.78E-10

Residual 27 0.464517 0.017204

Total 29 2.450313


Intercept 0.35554 0.106594 3.335464 0.002487 0.136827 0.574252

NORMAssess 0.709415 0.129678 5.47059 8.63E-06 0.443338 0.975492

NORMonths -0.33333 0.104837 -3.17947 0.003684 -0.54843 -0.11822

The equation:NormPrice=.355+.709Norm(Assess)-.333Norm(Months).

No point is influential (found by running the model with one dummy variable again on the transformed model). The model fit is improved, the model is significant, and all the variables are good predictors.

0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NormAssess-NormPrice

0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NormMonths-NormPrice

The improvement in the data spread is clear.

39

Using the model: Predict the price of the following house: Assess = $175000; Months = 6.

NormPrice=.355+.709Norm(Assess)-.333Norm(Months) =.355+.709[normdist(175, 177.47,9.672,1)]-.333[normdist(6, …)] = K.Price = norminv(K, 184.26,13.763).

Final comment: This model could be improved by adding quadratic terms to NormMonths and NormAssess. After doing so it turns out only NormMonths is significant and does contributes information to the model. No point is found influential after this transformation is made, and the model fit is raised to Adjusted r2 = .849 (from .79); the standard error decreases (which indicates the model prediction will be more accurate).

Unsatisfied assumptionsSeveral assumptions need to be met before a regression model can be used.

a. The normality of the error term . The violation of the normality requirement is usually considered a mild violation since the parameter estimates b0 and b1 and their variances remain unbiased and consistent. The problem is with the t tests used for testing the beta coefficients, because the normality assumption is made for these tests. However, if the sample size is large (n30) the tests should be OK. Methods to test the normality of a sample were discussed before and can be applied to the error term. Remedy for non-normality: We should first of all trim outliers (although this should be done with caution). Secondly, we should try to increase the sample size. Third, try to use normalizing transformations such as: (i) Logarithmic (usually natural or base 10) on both X and Y. (ii) Logarithmic on Y+ a (usually natural or base 10) (iii) Square root of Y + a (the constant ‘a’ is determined by the analyst such that Y+a≥ 0) (iv) Y is raised to some positive power. Comment: Worry about non-normality only if you have major outliers

b. The heteroscedasticity of the error’s variance . It is required that the variance is

constant for all values of each Xi. A graph of the error vs. y and vs. all Xi should reveal a violation of this requirement. A violation occurs when the range of the errors changes with y. Typically, the “fan-out” pattern is most usual, if the variance is not constant. See

40

the graph below:

5 10 15 20 25 30 35 40 45

-15

-10

-5

0

5

10

15

20

Residuals

Heteroscedasticity is a serious matter when inference about the model is made. The F-test for the model usefulness and the t-tests for the predictors’ significance cannot be conducted; furthermore, the prediction confidence intervals are invalid when the error standard deviation is not constant. See several examples later. While graphing the residuals vs. the predicted value of Y may reveal hetroscedasticity, it might not be sufficient. Some statistical testing can help determine that thisis a problem. Here are some such procedures.

The Breusch – Pagan (B – P) test detects heteroscedasticity where the variance value is linearly proportional to the X values. The test procedure is simple and involves running two regression models. 1. The original linear model 2. Regressing the residual squared values (as obtained in the first model) on the sample independent variables. Here are the details:

a. Run regression for the original model Y = b0 + b1X1 + … + bnXn + e. Let the residuals obtained in this run be denoted ri (i=1,2,…,n).

b. Calculate ri - sq for each residual. Run a second regression for the modelr-sq = C0 + C1Y +e

c. The hypotheses tested are: H0:

2 is a constant for all Xi. Ha:

2 is a linear combination of the variables Xi. The alternative hypothesis states that the error variance is linearly related to at least one of the variables. The test can be performed by first running the original regression model in order to obtain the residuals (ei = Yi – predicted Yi; In Excel it is called “Residuals”); then, regressing the squared residual values (e-sq) on the predicted Y values. The test statistic is nR2 where R2 is the coefficient of determination of the residual regression. The statistic is approximately Chi-sq distributed. So we reject H0 (that the data are homoscedastic) if nR2 > with 1 degree of freedom.

41

White test is an extension of the B-P test shown above. We assume the error’s variance is a complete second order function of the independent variables Xi. Thus we can regress e-sq on both the predicted Y and the square of the predicted Y. The statistic is nR2 and the null hypothesis (homoscedastic variance) is rejected if nR2 >2 with 2 degrees of freedom.

Szroeter test seems to work well in detecting non-constant variance without any specific assumption on the shape of the variance relationship to the independent variables.1. Sort the data with respect to one variable2. Run linear regression on the sorted data and record ‘ei’ 3. Calculate SUM(ei

2) and SUM(iei2) = 1e1

2+2e22+…

Calculate h = SUM(iei2)/ SUM(ei

2)

4: calculate the statistic Q=(h−n+12 )√( 6nn2−1 )

5: Perform the Z test: Reject H0 if Q > ZIf H0 is rejected we conclude that the variance is homogeneous with respect to X that was sorted.

Comment: Repeat the test by re-sorting the data for each variable of interest. In this case set the significance level to /k (where k is the number of independent variables tested (one at a time).

Remedy for Heteroscedasticity: We can try several approaches: (1) Express the data in relative terms (income per household, crime per capita etc.)(2) Transformations on both X and Y. (3) Transformations on Y. Use LnY or √Y instead of Y (only defined for Y>0).

c. Multicolinearity . This phenomenon occurs when two or more independent variables are correlated (thus not independent anymore). If the only purpose for constructing a regression model is prediction multicolinearity is not a problem provided the model’s r2 is satisfactory and the F-test is significant. However, this may be a severe problem for other reasons. To mention only two problems, multicolinearity may inflate the variance of the beta estimates, which have an immediate effect on the testing of these coefficients. Furthermore, the interpretation of the beta coefficients as ‘the slope with respect to each variable’ is no longer valid, because changes in one variable cause changes in its correlated counter-part, so one can’t require only one variable is changing at a time. There are several ways we can try to find a remedy to multicolinearity. Some of these methods will be discussed below, but first we need to detect its presence.

42

d. Detecting Multicolinearity in the Regression Model. (a) Significant correlation between pairs of variables. We can run t-tests for the

correlation coefficients as follows:

H0: ρ=0H1: ρ ≠0

The rejection region: |t|>t/2,n-2. The statistic t is defined by: t= r

√ 1−r 2

n−2Alternatively we can predetermine that there is some critical correlation beyond which the multicollinearity is considered serious, and therefore is tested as follows: H0: -r0 r0; H1: >|r0|.

The t statistic is adjusted as follows: t=

r−r0

√ 1−(r−r0)2

n−2, and the test is run as before.

(b) Non-significant ‘t-tests’ for all (or nearly all) the individual beta parameters while the F-test for the overall usefulness is significant.

(c) Signs of the beta coefficients are opposite to what is expected.(d) The coefficient of determination r2 of the complete model decreases when a

suspected correlated variable is dropped, while the adjusted r2 increases.

There is also a measure available that can help evaluate the amount of correlation among all the predictors. This measure is needed since the correlation measures (mentioned in part (a) above) reflect the relationships between pairs of variables, however three variables may be correlated while none of the pairs are. Read next(e) The Variance Inflationary Factor (VIF):

VIF j=1

1−R j2

VIFj measures the amount of error-variance increase due to correlation of Xj with the other variables. It is calculated for every variable Xj. Rj

2 is the coefficient of determination from regressing the variable Xj on the rest of the variables (excluding Y). If there is no correlation between Xj and the rest of the variables then Rj

2 = 0 and VIFj = 1. If there is a “perfect correlation” Rj

2 = 1, and VIFj = infinity. It is customary to believe that when VIFj > 5 there is strong variance inflation, and when VIFj > 10 there is a severe variance inflation problem. Variance inflation makes it impossible to use the t tests for the significance of the Beta coefficients.

43

Example 10 revisited

The Federal Trade Commission annually ranks varieties of domestic cigarettes according to their tar, nicotine, and carbon monoxide (CO). Past studies have shown that increases in the tar and nicotine contents are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke. The file SMOKE HAZZARD contains data for 25 brands tested in a recent year. Suppose we want to model the level of CO as a function of Tar, Nicotine, and Weight as follows: CO = 0 + 1Tar + 2Nicotine + 3Weight + Produce the linear regression and study the multicollinearity situation.

Solution

From Excel we have the following output:

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.958431R Square 0.918589Adjusted R Square 0.906959Standard Error 1.445726Observations 25ANOVA


F

Regression 3 495.2578165.085

978.9838

3 1.33E-11

Residual 21 43.892592.09012

3Total 24 539.1504


Error t Stat P-value Lower 95% Upper 95%

Intercept 3.20219 3.4617550.92501

90.36546

4 -3.99692 10.4013

TAR 0.962574 0.2422443.97356

60.00069

2 0.458799 1.466349

NICOTINE -2.63166 3.900557 -0.674690.50723

4 -10.7433 5.479992

WEIGHT -0.13048 3.885342 -0.033580.97352

7 -8.21049 7.949529

Observations:

(2) Sig. F = 1.33(10-11) 0, indicates the model is useful. (3) Two of the three variables (Nicotine and Weight) are insignificant (their p-value is very

large). From (1) and (2) together we conclude either one variable (Tar) is significant or multicollinearity is present.

44

(4) The beta coefficients signs of both Nicotine and Weight are negative. At least with respect to nicotine this is opposite to our expectation (and knowledge). This strengthens the suspicion multicollinearity is a problem.

Finally, when running the Correlation matrix we have:

There is a large correlation between tar and nicotine, and moderate correlation between tar and weight and nicotine and weight. When applying the t tests for the correlation of these three pairs we get p-values less than or around .01, which indicates the real correlation values are different than zero in all three cases. As a matter of fact, for tar and nicotine, we can test whether the real correlation is greater than (say) r0 = .80, and it is indeed!

Comment: A quick rule to help determine that probably the real correlation between two

variables differs from zero is: |r|> 2√n

¿for a sample of n 20!!) in a two-tail test of 5%

significance level.

(5) By now we conclude tar is pair-correlated with the other two variables. It makes sense to try to run the model without tar and re-check for problems. This will be done later. For now (just to demonstrate how to calculate VIF) we proceed by determining VIFTar which will identify whether there is a problem of variance inflation for “tar”. VIFTar is defined by

VIFTar=1

1−RTar2 where RTar

2 is the coefficient of determination obtained when regressing Tar

on Nicotine and Weight (Tar is the dependent variable and the other two variables are the independent variables). From this run we get the following output:

SUMMARY OUTPUT


Multiple R 0.976611

R Square 0.953769Adjusted R Square 0.949567


Observations 25

ANOVA


Regression 2 734.816 367.408 226.9378 2.06E-15

Residual 22 35.61759 1.618981

Total 24 770.4336

45

TAR NICOTINE WEIGHTTAR 1

NICOTINE0.976607

6 1

WEIGHT0.490765

4 0.5001827 1


Intercept -1.64996 3.026335 -0.5452 0.5911 -7.9262 4.626271

NICOTINE 15.60377 0.847155 18.41903 7.39E-15 13.84687 17.36066

WEIGHT 0.196668 3.419256 0.057518 0.954652 -6.89443 7.28777

RTar2 = .9537, so VIFTar=

11−RTar

2 = 11−.9537

=21.59>10. The other VIF values need to be

calculated too (for Nicotine and Weight) but we shall omit the process at this point. So far we know we should be careful in interpreting the coefficient Tar, if we keep TAR in the model.

Remedy to problems created by Multicollinearity1. Drop one of the correlated independent variables from the model.2. If you decide all the independent variables should stay in spite of the multicollinearity

effect, i. Avoid making inference about the individual beta parameters based on t-testsii. Restrict the inferences about the mean value of Y and the value of Y itself to Xi values

that fall inside the ranges covered by the sample. In other words, don’t construct confidence intervals for the mean Y or prediction interval for the Y for Xi that falls outside the sample interval.

Comment: To complete the analysis, the variable Tar will be eliminated and the regression run again on the model CO = 0 + 1Nicotine + 2Weight + Checking the results of this run we realize that the variable Weight does not add information to the model which already includes Nicotine (p value = .954). This variable can be dropped, and we are left with a single predictor – Nicotine. Now we need to make sure this reduced model have no outliers, influential points, the normality and homoscedasticity requirements are met.

Summary ExamplesExample 11Let us revisit problem 5 where we built a linear model with interaction term to predict the MPG of a car based on its engine horsepower (HP) and Weight. We asked whether or not an interaction term added information to the original model. We turn back to the original model and analyze both independent variables (HP and Weight), as well as the error requirements. We run the original model MPG = 0+1HP+2Weight+ and perform residual analysis.

Outliers and influential observations. After running the regression we don’t identify any

46

outliers or influential observations in the graph of standard residuals vs. predicted MPG:

5 10 15 20 25 30 35 40 45

-3

-2

-1

0

1

2

3

Standard Residuals

One observation is unusual, but not an outlier. In addition no influential observation can be found.NormalityBelow is the probability plot:

0 20 40 60 80 100 1200

5

10

1520

25

30

3540

45

50

Normal Probability Plot

Sample Percentile

MP

G

The line seems to be linear, so the data follow normal distribution

HeteroscedasticityObserving the standardized residual graph shown above it appears the variance remains the same for most of the interval.

We can now proceed with the analysis of the regression output:

ANOVA


Regression 2 2451.974 1225.987 70.28128 7.51E-15

Residual 47 819.8681 17.444

Total 49 3271.842


Intercept 58.15708 2.658248 21.87797 2.76E-26 52.80938 63.50479

Horsepower -0.11753 0.032643 -3.60028 0.000763 -0.1832 -0.05186

Weight -0.00687 0.001401 -4.90349 1.16E-05 -0.00969 -0.00405

47

(i) The model is very useful as verified by the significance F = 7.51(10-15) so at least one variable is linearly related to MPG. To check which variable contributes significant information, we turn to the individual variables.

(ii) The variables Horse Power and Weight may be correlated, and this is what we check first. From the correlation matrix we have r =.7418. So multi-co-linearity between the two independent variables is present, and this might become a problem when using the t tests. To verify if this is so we need to calculate VIFHP (or VIFWeight; they are equal to one another because there are only two independent variable).

(iii) Calculating VIF. To obtain R2HP we regress HP (as a dependent variable) on Weight. R2

HP=.55; VIFHP = 1/(1-R2

HP) = 1/(1-.55) = 2.22. This means that the problem of variance inflation is small therefore we can use the t-tests to check the validity of the two independent variables.

(iv) Since both p values of the Beta coefficient tests are extremely small (.00076 and 1.16(10 -5)), we can infer that both variables contribute new information (in spite of the correlation between them) in the presence of the other.

(v) Now we need to specify the model we prefer to work with. We start with any of these variables as the only variable in the model and see if it is significant. If it is, then we know already that adding the other variable improves the model (see part (i)). If the first variable is insignificant, we know the other variable must be significant (since at least one of them is significant-part (i) again). But then the first variable should be added too (see part (i)). So we end up having the original two variables entered into the model anyway.Comment: We already know that adding interaction to the original model is advisable.

Example 12 (This example was used before. Data here were altered for instructional reasons)Develop a model to predict the selling price of a house, based on assessed value, and time in the market before sale is made. Data of 30 single-family houses were recorded and saved in HOUSE.

(a) Perform residual analysis. Eliminate outliers and influential observations.(b) Build an appropriate model.(c) Explain the results and perform a forecast as instructed later

SolutionWe run the following model: Price = 0 + 1Assessed Price + 2Time +

(a) The graphs of each predictor vs. price are generated first. We look for extreme values in terms of leverage and vertical deviation. These Charts are provided below:

48

http://mihaylofaculty.fullerton.edu/sites/zgoldstein/560/Contents/DataFiles/Multiple%20Regression/House1.xls

140.00 145.00 150.00 155.00 160.00 165.00 170.00 175.00 180.00 185.00 190.000.00

50.00

100.00

150.00

200.00

250.00

Price - Ass. Value

0 2 4 6 8 10 12 14 16 180.00

50.00

100.00

150.00

200.00

250.00

Price-Time

One point seems to stand out (Price =161.90, Assessed value =148.90, Time = 4 months). Let us check its influence. We add a dummy variable to the input data set and run the model over the following data:

PriceAssessed

ValueTime

(months) Dummy194.10 178.17 10 0

* * * *161.90 148,90 4 1

* * * *195.90 179.07 12 0

When testing the significance of in the model Price = 0 + 1Assessed Price + 2Time + Dummy + we have p-value = .000229. The influence of this point is established and as instructed, will be omitted. For the data left we rerun the model and check whether we can identify any outlier. Observation 4 has a standardized residual of 3.44 thus is considered an outlier. This observation is eliminated too.

(b) The rest of the analysis proceeds without the influential points. We produce the graphs of the predictors vs. the standardized residuals, in order to check whether some transformations are needed.

0 2 4 6 8 10 12 14 16 18

-3

-2

-1

0

1

2

3

Standard Residuals vs. Time

49

There is no pattern observed in this graph. The error seems to behave randomly as required.

150.00 160.00 170.00 180.00 190.00

-3

-2

-1

0

1

2

3

Standard Residuals vs. Assessed Value

we’ll add Assessed-sq to the regression model. The new model is: Price = 0+1Assessed+2Time+3Assessed2+.

(c) We run the new model, and repeat the above analysis. No outlier is found, no influential observation. From observing the predictor graphs (as in part (b)) we have no evidence for any pattern:

0 2 4 6 8 10 12 14 16 18

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Standard Residuals vs. Time

150.00 160.00 170.00 180.00 190.00

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Standard Residuals vs. Assessed Value

50

There is a curved pattern where residuals are larger for small and large assessed values, and drop somewhat for midrange assessed values. This pattern usually indicates a quadratic term should be added to the equation.

(d) We check for heteroscedasticity by observing the new graph of predicted price vs. standardized residuals.

150 160 170 180 190 200 210 220

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Standard Residuals -Predicted Price

There is no clear pattern (fan-out, fun in, etc.) so heteroscedasticity is not considered a problem.

(e) Multicolinearity between Assed…and Time does not seem to be a serious problem (although there is an expected high correlation between Assessed… and Assessed…-sq). This is so because the correlation between Time and Assessed…is .135.

(f) The normality requirement is satisfied; when testing it turns out there is insufficient evidence at 5% significance level to infer the error variable is not normal.

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-6

-4

-2

0

2

4

6

8

10

12

Normal Probability Plot

(g) We are ready to test the model for its validity. Let us look at the computer output:


Multiple R 0.979271

R Square 0.958972



Observations 28

ANOVA


Regression 3 2952.21 984.07 186.9911 8.99E-17

51

Residual 24 126.3037 5.262656

Total 27 3078.514


Intercept 681.4166 289.7387 2.351831 0.027224 83.42523 1279.408

Assessed Value -7.37017 3.321732 -2.21877 0.036212 -14.2259 -0.51445

Time 0.600651 0.106958 5.61574 8.83E-06 0.3799 0.821403

Assess^2 0.025842 0.009503 2.719245 0.011966 0.006228 0.045455

Adding the squared Assessed Price variable is justified at 5% significance value, because it adds information to the model (p value = .011).

The models’ overall usefulness is very good (Significance F = 8.99(10-17), and it fits the data very well (R2 = .958.)

Every month a house stays in the market its selling price increases by an average of .6(1000) = $600 (this data must have been collected before the real estate bubble started…).

Do not interpret the beta coefficients of each one of the AssessedPrice variables individually (Remember there are two terms in the equation related to the effect of assessed price on the selling price). We can argue though the rate at which selling price changes with assessed price depends on the assessed price level. Specifically, (selling price)/Assessed Price) ≈ -7.37+2(.0258)(assessed price). This is found by taking the derivative of the selling price with respect to the assessed price. So the selling price increases at an increasing rate with the assessed price. For example, when assessed price = 150, every $1000 increase in the price assessment changes the predicted selling price by (about) -7.37+2(.0258)(150) = .37 [$370]; If the assessed price = 350, every $1000 increase in the value assessment changes the predicted selling price by (about)-7.37+2(.0258)(350) = 10.69 [$10,690].

52

Example 13In this example we return to example 10 and try to take care of the influential observations problem, only part of which was taken care of before. As we do this you’ll see additional possible transformations in action. For convenience let us repeat the transformation used in example 9: Ln(CO) =.0429 + .989Ln(Tar) - .2Ln(Nicotine) - .044Ln(Weight). For this transformed model we were able to reduce the influence of an observation on the right hand side of the range, but created unfortunately an influential point on the left hand side. The graphs below describe the situation.Standard residuals vs. Predictors before Standard residuals vs. Predictors after

Ln Transformation Ln Transformation

0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Nicot -2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0.700000000000001 0.900000000000001 1.1 1.3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Weight-0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 5 10 15 20 25 30 35

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Tar0 0.5 1 1.5 2 2.5 3 3.5 4

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

TAR

53

NicotineNicotine

WeightWeight

TarTar

When calculating the leverage statistic for the two cases we find two different influential points before and after the transformation respectively. The reason is that the Ln transformation tends to shift values to the right in an uneven manner. This is why the Ln transformation should not be used here since there is one small value of both Tar and Nicotine that becomes influential (in the graphs above we chose to demonstrate the situation with Tar only) once the Ln transformation is applied. Observing the way the Tar and Nicotine values are distributed along their corresponding ranges we notice there are a few values at the two ends and many values located near the center. We need to spread the values more evenly so that the variabilty of each variable increases; this will (hopefully) prevent the end points from being considered separated from the rest of the group. For this specific case one such transformation can be produced by the normal cumulative distribution function. The follwing drawing demonstrates this distribution.

Observe that the x values are transformed very little at the two ends but are changing values rapidly at the center. This means the group of X values will spread out at the center of the range when transformed. Let us apply this normal transformation to the data set (to calculate the values P(X≤X0) we use Excel’s NORMDIST with = average(x) and = standard deviation(x). Here is an example of the transformed data.

Origial data Transformed data

CO TARNICOTIN

E WEIGHT CO TAR NICOTINE WEIGHT

13.6 14.1 0.86 0.9853 13.60.6302520

10.4815275

8 0.9853

P(TAR<14.1) when TAR = 12.216 and TAR = 5.5665 is calculated by NORMDIST(14.1,12.216,5.566,1) = .630. Such transformations are performed for TAR and NICOTINE. The regression results are:CO=-2.92+13.277NORMDIST(TAR,12.216,5.566,1)+2.897NORMDIST(NICOT,.876,.354,1)+7.698Weight.No influential point is identified by calculating the leverage statistic. More work need to be done on this model because of multicollinearity problems (between TAR and NICOTINE). This work is omitted here.

54

X0

1

Date post:	06-Feb-2018
Category:	Documents
Upload:	donga
View:	213 times
Download:	0 times

mihaylofaculty.fullerton.edumihaylofaculty.fullerton.edu/sites/zgoldstein/560/Contents… · Web...

Documents