+ All Categories
Home > Documents > 15 Building Regression Models Part2

15 Building Regression Models Part2

Date post: 06-Jul-2018
Category:
Upload: rama-dulce
View: 221 times
Download: 0 times
Share this document with a friend

of 17

Transcript
  • 8/17/2019 15 Building Regression Models Part2

    1/17

    - 1 -

    BUILDING REGRESSION MODELS – PART 2

    Topics Outline

    • Include/Exclude Decisions

    • Variable Selection Procedures 

    Example 1

    Explaining spending amounts at HyTex

    HyTex is a direct marketer of stereo equipment, personal computers, and other electronicproducts. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders aretaken over the telephone. The company spends a great deal of money on its catalog mailings, andit wants to be sure that this is paying off in sales.

    The file Catalog_Marketing.xlsx contains data on 1000 customers who purchased mail-orderproducts from the HyTex Company in the current year. For each customer there are data on the

    following variables:

    Age – age of the customer at the end of the current year

    Gender = 1 for males, 0 for females

    OwnHome = 1 if customer owns a home, 0 otherwise

    Married = 1 if customer is currently married, 0 otherwise

    Close = 1 if customer lives reasonably close to a shopping area that sells similarmerchandise, 0 otherwise

    Salary – combined annual salary of customer and spouse (if any)

    Children – number of children living with customer

    PrevCust = 1 if customer purchased from HyTex during the previous year, 0 otherwise

    PrevSpent – total amount of purchases made from HyTex during the previous year

    Catalogs – number of catalogs sent to the customer this year

    AmountSpent – total amount of purchases made from HyTex this year

    Develop a multiple regression model that is useful for explaining current year spending amounts at HyTe

    Solution:

    With this much data, 1000 observations, it is possible to set aside part of the data set for validation.

    Although any split can be used, lets base the regression on the first 750 observations and use theother 250 for validation. Therefore, you should select only the range through row 751 whendefining the StatTools data set.

    (a) Regression 1Run first a multiple regression with all explanatory variables.The goal is then to exclude variables that aren't necessary, based on their t -values and P-values.Here is the multiple regression output.

  • 8/17/2019 15 Building Regression Models Part2

    2/17

    - 2 -

    It indicates a fairly good fit. The r 2 value is 74.7% and se is about $491. Given that the actualamounts spent in the current year vary from a low of under $50 to a high of over $5500, witha median of about $950, a typical prediction error of around $491 is decent but not great.

    (b) Which variable(s) would you exclude from the regression equation?

    From the P-value column, you can see that there are four variables, Age, Gender, OwnHome,and Married, that have P-values well above 0.05. These are the obvious candidates for

    exclusion from the equation. You could rerun the equation with all four of these variablesexcluded, but it is a better practice to exclude one variable at a time. It is possible that whenone of these variables is excluded, another one of them will become significant.

    (c) Rerun the regression after excluding the variables with the largest P-values one at a time.

    Regression 2The variable Married has the largest P-value. The result from rerunning the regressionwithout this variable shows that Age, Gender, and OwnHome still have large p-values.

    Regression 3

    The variable with the largest remaining P-value Age is excluded.

    Regression 4The variable with the largest remaining P-value OwnHome is excluded.

    Regression 5The variable with the largest remaining P-value Gender is excluded.Here is the resulting output.

  • 8/17/2019 15 Building Regression Models Part2

    3/17

    - 3 -

    The r 2 and se values of 74.6% and $491 are almost the same as they were with all variablesincluded, and all of the P-values are very small.

    (d) Interpret the coefficients of the final regression equation.

    The coefficient of Close implies that an average customer living close to stores with this typeof merchandise spent about $416 less than an average customer living far from such stores.

    The coefficient of Salary implies that, on average, about 1.8 cents of every extra salary dollarwas spent on HyTex merchandise.

    The coefficient of Children implies that about $161 less was spent for every extra child living at hom

    The PrevCust and PrevSpent terms are somewhat more difficult to interpret.First, both of these terms are zero for customers who didn't purchase from HyTex in theprevious year. For those who did, the terms become

    –544 + 0.27PrevSpent  

    The coefficient 0.27 implies that each extra dollar spent the previous year can be expected tocontribute an extra 27 cents in the current year. The ‒544 literally means that if you compare acustomer who didn't purchase from HyTex last year to another customer who purchased only atiny amount, the latter is expected to spend about $544 less than the former this year. However,none of the latter customers were in the data set. A look at the data shows that of all customers

    who purchased from HyTex last year, almost all spent at least $100 and most spent considerablymore. In fact, the median amount spent by these customers last year was about $900 (the medianof all positive values for the PrevSpent variable). If you substitute this median value into the

    expression ‒544 + 0.27PrevSpent , you obtain –298. Therefore, this “median” spender from lastyear can be expected to spend about $298 less this year than the previous year nonspender.

    The coefficient of Catalogs implies that each extra catalog can be expected to generate about$44 in extra spending.

  • 8/17/2019 15 Building Regression Models Part2

    4/17

    - 4 -

    (e) Do forward, backward, and stepwiseprocedures produce the same regressionequation for the amount spent in thecurrent year?

    Each of these options is found in theStatTools Regression dialog box shownto the right. It is just a matter of choosingthe appropriate option from theRegression Type dropdown list.

    In each, specify AmountSpent as thedependent variable and select all of theother variables (besides Customer) aspotential independent variables.Once you choose one of the regression

    types, the dialog box changes, as shownbelow, to include a Parameters sectionand an “advanced” option to IncludeDetailed Step Information.

    It turns out that each regressionprocedure (stepwise, forward, andbackward) produces the same finalequation that we obtained previously,with all variables except Age, Gender,OwnHome, and Married included.This often happens, but not always.

    The stepwise and forward proceduresadd the variables in the order Salary,Catalogs, Close, Children, PrevCust,and PrevSpent.

    The backward procedure, which startswith all variables in the equation,eliminates variables in the orderMarried, Age, OwnHome, and Gender.A sample of the stepwise output appearsbelow.

  • 8/17/2019 15 Building Regression Models Part2

    5/17

    - 5 -

    The variables that enter or exit the equation are listed at the bottom of the output. The usualregression output for the final equation also appears. Again, however, this final equation'soutput is exactly the same as when multiple regression is used with these particular

    variables.

    Notes:

    1. If you validate this final regression equation on the other 250 customers, you will find 2r  andse values of 73.2% and $486. These are very promising. They are very close to the valuesbased on the original 750 customers.

    2. We haven't tried all possibilities. We haven't tried nonlinear or interaction variables,nor have we looked at different coding schemes (such as treating Catalogs as a categoricalvariable and using dummy variables to represent it).

    3. We haven't checked the regression assumptions. In particular, it turns out that the conditionfor constant error variance is violated as can be seen from the fan shape of the scatterplot ofAmountSpent versus Salary:

  • 8/17/2019 15 Building Regression Models Part2

    6/17

    - 6 -

    As usual, when you see a fan shape, where the variability increases from left to right in ascatterplot, you can try a logarithmic transformation. The reason this often works is that thelogarithmic transformation squeezes the large values closer together and pulls the smallvalues farther apart. The scatterplot of the log of AmountSpent versus Salary is shown below.

    Clearly, the fan shape is gone. However, the logarithmic transformation appears to haveintroduced some curvature into the plot. So, perhaps some other nonlinear transformations areworth exploring in this example.

  • 8/17/2019 15 Building Regression Models Part2

    7/17

    - 7 -

    Example 2

    Possible gender discrimination in salary at Fifth National Bank of Springfield

    The Fifth National Bank of Springfield is facing a gender discrimination suit.The charge is that its female employees receive substantially smaller salaries than its male

    employees. The bank's employee data are listed in the file Bank_Salaries.xlsx.

    Employee EducLev JobGrade YrsExper Age Gender YrsPrior PCJob Salary

    1 3 1 3 26 Male 1 No $32,000

    2 1 1 14 38 Female 1 No $39,100

    M   M   M   M   M   M   M   M   M  207 5 6 35 59 Male 0 No $94,000

    208 5 6 33 62 Female 0 No $30,000

    For each of the 208 employees, the data set includes the following variables:

    EducLev – education level, a categorical variable with categories1 (finished high school), 2 (finished some college courses), 3 (obtained a bachelor's degree),4 (took some graduate courses), 5 (obtained a graduate degree)

    JobGrade – a categorical variable indicating the current job level, the possible levels being 1 through 6

    YrsExper – years of experience with this bank

    Age – employee's current age

    Gender – a categorical variable with values “Female” and “Male”

    YrsPrior – number of years of work experience at another bank prior to working at Fifth National

    PCJob – a categorical yes/no variable depending on whether the employee's current job iscomputer-related

    Salary – current annual salary

    Do these data provide evidence that the bank discriminates against females in terms of salary?

    A formal hypothesis test to compare the average female salary to the average male salary couldbe run. Using this method, you can check that the average of all salaries is $39,922, the femaleaverage is $37,210, the male average is $45,505, and the difference between the male and femaleaverages is statistically significant at any reasonable level of significance.

    In short, the females definitely earn less. But perhaps there is a reason for this.They might have lower education levels, they might have been hired more recently, and so on.The question is whether the difference between female and male salaries is still evident aftertaking these other attributes into account.

    Solution: 

  • 8/17/2019 15 Building Regression Models Part2

    8/17

    - 8 -

    (a) Create dummy variables for the various categorical variables.

    Using Excel

    Create a dummy variable Female based on Gender in column J by entering the formula=IF(F2= “Female”,1,0)

    in cell J2 and copying it down.Note that females are coded as 1s and males as 0s.

    Create a dummy variable HasPCJob based on PCJob in column K by entering the formula=IF(H2= “Yes”,1,0)in cell K2 and copying it down.

    Using StatTools

    StatTools's Dummy procedure is somewhat easier, especially when there are multiple categories.Here are the steps to create five dummies for the education levels.

    Data UtilitiesDummySelect EducLev to base the dummies onCreate One Dummy Variable for Each Distinct CategoryOKYes

    This creates five dummy columns with variable names EducLev = 1 through EducLev = 5.Follow the same procedure to create six dummies, JobGrade = 1 through JobGrade = 6.

    (b) Regression 1

    Estimate a regression equation with only one explanatory variable, Female and interpret it.The output appears below.

    The resulting equation is

    Predicted Salary = 45505 – 8296 Female 

    To interpret regression equations with dummy variables, it is useful to rewrite the equation foreach category.

  • 8/17/2019 15 Building Regression Models Part2

    9/17

    - 9 -

    If you substitute Female = 1 into the estimated regression equation, you obtain

    Predicted Salary = 45505 – 8296(1) = 37209

    Because Female = 1 corresponds to females, this equation simply indicates the average female salary.

    Similarly, if you substitute Female = 0 into the estimated equation, you obtain

    Predicted Salary = 45505 – 8296(0) = 45505

    Because Female = 0 corresponds to males, this equation indicates the average male salary.Therefore, the interpretation of the – 8296 coefficient of the Female dummy variable is straightforwarIt is the average female salary relative to the reference (male) category.In short, females get paid $8296 less on average than males.

    (c) Regression 2Expand the regression equation by adding the experience variables YrsExper and YrsPrior.

    Here is the output with the Female dummy variable and these two experience variables.

    The corresponding regression equation is

    Predicted Salary = 35492 + 988 YrsExper  + 131 YrsPrior  – 8080 Female 

    It is again useful to write this equation in two forms: one for females (substituting Female = 1)and one for males (substituting Female = 0). After doing the arithmetic, they become

    Predicted Salary = 27412 + 988 YrsExper  + 131 YrsPrior  

    Predicted Salary = 35492 + 988 YrsExper  + 131 YrsPrior  

    Except for the intercept term, these equations are identical. You can now interpret thecoefficient – 8080 of the Female dummy variable as the average salary disadvantage forfemales relative to males after controlling for job experience.

    Gender discrimination still appears to be a very plausible conclusion.

    Note that the r 2 value is only 49.2%. Perhaps there is still more to the story.

  • 8/17/2019 15 Building Regression Models Part2

    10/17

    - 10 -

    (d) Regression 3Add education level to the equation by including any four of the five education level dummies,for example by including EducLev = 2 through EducLev = 5. (Reminder: You should alwaysuse one fewer dummy than the number of categories for any categorical variable.)

    Here is the resulting output.

    The estimated regression equation is now

    Predicted Salary = 26613 + 1033 YrsExper  + 362 YrsPrior  – 4501 Female + 160 EducLev=2 + 4765 EducLev=3 + 7320 EducLev=4 + 11770 EducLev=5 

    Now there are two categorical variables involved, gender and education level.

    However, you can still write a separate equation for each combination of categories bysetting the dummies to appropriate values. For example, the equation for females ateducation level 5 is found by setting Female and EducLev=5 equal to 1, and setting the othereducation dummies equal to 0. After combining terms, this equation is

    Predicted Salary = 33882 + 1033 YrsExper  + 362 YrsPrior  

    This equation can be interpreted as follows. For either gender and any education level,the expected increase in salary for one extra year of experience with Fifth National is $1033;the expected increase in salary for one extra year of prior experience with another bank is $362.

    The coefficients of the education dummies indicate the average increase in salary anemployee can expect relative to the reference (lowest) education level.

    For example, an employee with education level 4 can expect to earn $7320 more than anemployee with education level 1, all else being equal.

    The key coefficient, – $4501 for females, indicates the average salary disadvantage for femalesrelative to males, given that they have the same experience levels and the same education levels.

    Note that the r 2 value is now 64.5%, quite a bit larger than when the education dummies were not

    included. We appear to be getting closer to the truth. In particular, you can see that there appears tobe gender discrimination in salaries, even after accounting for job experience and education level.

  • 8/17/2019 15 Building Regression Models Part2

    11/17

    - 11 -

    (e) Regression 4Add the remaining explanatory variables to the model: JobGrade=2 through JobGrade=6(the lowest job grade is used as the reference category), Age and HasPCJob.The regression output for this equation with all variables appears below.

    The effect of age appears to be minimal, and there appears to be a “bonus” of close to $5000

    for having a PC-related job.

    The r 2 value has now increased to 76.5%, and the penalty for being a female has decreased to

    $2555 – still large but not as large as before.

    As expected, the coefficients of the job grade dummies are all positive, and they increase asthe job grade increases – it pays to be in the higher job grades. Thus, the regression indicatesthat being in lower job grades implies lower salaries, but it doesn't explain why females arein the lower job grades in the first place.

    (f) Regression 5

    If you rerun the regression using the numerical explanatory variable YrsExper and thedummy variable Female, you obtain the equation

    Predicted Salary = 35824 + 981 YrsExper  – 8012 Female 

    The r 2 value for this equation is 49.1%.

    It is certainly plausible that the effect of YrsExper on Salary is different for males than for females.So, it makes good sense to test for an interaction between YrsExper and Female variables.

  • 8/17/2019 15 Building Regression Models Part2

    12/17

    - 12 -

    (g) Regression 6If an interaction variable between YrsExper and Female is added to this equation, what is its effect?

    You first need to form an interaction variable that is the product of YrsExper and Female.

    Using ExcelUse an Excel formula that multiplies the two variables involved.

    Using StatTools

    Data UtilitiesInteractionInteraction Between: Two Numeric VariablesSelect YrsExper and FemaleOK

    Now you can run the regression. The multiple regression output appears below.

    Notice that the r 2 value with the interaction variable has increased from 49.1% to 63.9%.The interaction variable has definitely added to the explanatory power of the equation.The estimated regression equation is

    Predicted Salary = 30430 + 1528 YrsExper  + 4098 Female – 1248 Interaction(YrsExper,Female)

    The negative interaction here means that females tend to get lower raises for each extra yearof experience than the males get. To unravel the meaning of this negative interaction, it is usefulto write the above equation as two separate equations, one for females and one for males.

    The female equation (Female = 1, so that Interaction(YrsExper,Female) = YrsExper ) isPredicted Salary = (30430 + 4098) = (1528 – 1248) YrsExper  = 34528 + 280 YrsExper  

    and the male equation (Female = 0, so that Interaction(YrsExper,Female) = 0 ) is

    Predicted Salary = 30430 + 1528 YrsExper  

    Graphically, these equations appear in the following figure.

  • 8/17/2019 15 Building Regression Models Part2

    13/17

    - 13 -

    The y-intercept for the female line is slightly higher – females with no experience with Fifth Nationaltend to start out slightly higher than males – but the slope of the female line is much smaller.That is, males tend to move up the salary ladder much more quickly than females. This providesanother argument, although a somewhat different one, for gender discrimination against females.

    Notes:

    1. Interaction variables can make a regression quite difficult to interpret, and they are certainlynot always necessary. However, without them, the effect of each x on y is independent of thevalues of the other x’s. If you believe, as in this example, that the effect of years of experienceon salary is different for males than it is for females, the only way to capture this behavior is

    to include an interaction variable between years of experience and gender.

    2. The product of any two variables, a numerical and a dummy variable, two dummy variables,or even two numerical variables, can be used to create an interaction term. The easiest way tointerpret the results correctly is the way we have been doing it – by writing several separateequations and seeing how they differ.

    (h) Suppose you include the variables YrsExper, Female, and HighJob in the equation for Salary,along with interactions between Female and YrsExper and between Female and HighJob.Here, HighJob is a new dummy variable that is 1 for job grades 4 to 6 and is 0 for job grades 1 to 3.(It can be calculated as the sum of the dummies JobGrade = 4 through JobGrade = 6.)The resulting equation is

    Predicted Salary = 28168 + 1261 YrsExper  + 9242 HighJob + 6601 Female – 1224 Interaction(YrsExper,Female)  + 1564 Interaction(Female,HighJob) 

    and the r 2 value is now 76.6%.

    Interpret the regression coefficients.

  • 8/17/2019 15 Building Regression Models Part2

    14/17

    - 14 -

    The interpretation of this equation is quite a challenge because it is really composed of fourseparate equations, one for each combination of Female and HighJob.For females in the high job category, the equation becomes

    Predicted Salary = (28168 + 9242 + 6601 + 1564) + (1261 - 1224) YrsExper  = 45575 + 37 YrsExper  

    and for females in the low job category it is

    Predicted Salary = (28168 + 6601) + (1261 - 1224) YrsExper  = 34769 + 37 YrsExper  

    Similarly, for males in the high job category, the equation becomes

    Predicted Salary = (28168 + 9242) + 1261 YrsExper  = 37410 + 1261 YrsExper  

    and for males in the low job category it is

    Predicted Salary = 28168 + 1261 YrsExper  

    Putting this into words, the various coefficients can be interpreted as follows.

    The intercept 28168 is the average starting salary (that is, with no experience at Fifth National)for males in the low job category.

    The coefficient 1261 of YrsExper is the expected increase in salary per extra year ofexperience for males (in either job category).

    The coefficient 9242 of HighJob is the expected salary premium for males starting in thehigh job category instead of the low job category.

    The coefficient 6601 of Female is the expected starting salary premium for females relativeto males, given that they start in the low job category.

    The coefficient –1224 of Interaction(YrsExper,Female) is the penalty per extra year ofexperience for females relative to males – that is, male salaries increase this much more thanfemale salaries each year.

    The coefficient 1564 of Interaction(Female,HighJob) is the extra premium (in addition to themale premium) for females starting in the high job category instead of the low job category.

    (i) Regression 7A glance at the distribution of salaries of the 208 employees shows some skewness to the right –a few employees make substantially more than the majority of employees. Therefore, it mightmake more sense to use the natural logarithm of Salary as the dependent variable, not Salary.Run a regression with Log(Salary) as the dependent variable and YrsExper and Female asexplanatory variables. How can you interpret the results?

    Here are the results obtained after creating the Log(Salary) variable and running the regression.

  • 8/17/2019 15 Building Regression Models Part2

    15/17

    - 15 -

    The estimated regression equation is

    Predicted Log(Salary) = 10.4907 + 0.0188 YrsExper  – 0.1616 Female 

    The 2r  and es  values are 42.4% and 0.1794.

    When this same equation was estimated with Salary as the dependent variable, 2r  and es  were

    49.1% and 8,070. However, these measures are not directly comparable because when the logarithmof y is used in the regression equation the units of the dependent variable are completely different.

    The two 2r  values are percentages explained of different  dependent variables, Log(Salary) and SalaryThe fact that one is smaller than the other (42.4% versus 49.1%) does not necessarily meanthat it corresponds to a worse fit. They simply are not comparable.

    Each es  is a measure of a typical residual, but the residuals in the Log(Salary) equation are in

    log dollars, whereas the residuals in the Salary equation are in dollars. These units are oftotally different magnitudes. For example, the log of $1000 is only 6.91. Therefore, it is no

    surprise that es  for the Log(Salary) equation is much smaller than es for the Salary equation.

    If you want comparable standard error measures for the two equations, you should take antilogs(using the EXP function in Excel) of fitted values from the Log(Salary) equation to convert themback to dollars, subtract these from the original Salary values, and take the standard deviation ofthese “residuals.” You can check that the resulting standard deviation is 7,774. This is somewhat

    smaller than es  = 8,080 from the Salary equation, an indication of a slightly better fit.

    To interpret the regression equation itself, recall that when the dependent variable is log( y)and a term on the right-hand side of the equation is of the form bx, then whenever x increases

    by one unit, the predicted value of y

    changes by a constant percentage, and this percentage isapproximately equal to b (written as a percentage). Thus, the regression coefficient forYrsExper means that for each extra year of experience with Fifth National, an employee'ssalary can be expected to increase by about 1.88%.

    To interpret the Female coefficient, note that the only possible increase in Female is one unit(from 0 for male to 1 for female). When this occurs, the expected percentage decrease insalary is approximately 16.16%. In other words, the regression equation implies that femalescan expect to make about 16% less than men for comparable years of experience.

  • 8/17/2019 15 Building Regression Models Part2

    16/17

    - 16 -

    (j) In Regression 6 we regressed Salary versus the Female dummy, YrsExper, and the interactionbetween Female and YrsExper, Interaction(YrsExper,Female). The output appears below.

    This group of three explanatory variables,

    Block1 = Female, YrsExper, Interaction(YrsExper,Female),already explains 63.9% of the variation in Salary. Does including the followings groups ofexplanatory variables add anything significant to what we already have?

    Block2 = EducLev dummies, EducLev=2 to EducLev=5Block3 = JobGrade dummies, JobGrade=2 to JobGrade=6Block4 = interactions between the Female dummy and the education dummies,

    Interaction(Female,EducLev=2) to Interaction(Female,EducLev=5)

    This question can be answered by performing several partial F  tests.With StatTools, this analysis can be done in one step.

    Select the Block option from the Regression Typedropdown list. The dialog box then changes, asshown in the figure to the right.

    Number of blocks: 4Check which variables are in which blocks.Check Salary as dependent variable.Specify 0.05 as the P-Value to enter, which in thiscase indicates how significant the block as a wholemust be to enter for the partial F  test.OK

    The regression calculations are done in stages.At each stage, the partial F  test checks whether ablock is significant. If it is, the variables in thisblock enter and the procedure goes to the nextstage. If it is not, the procedure ends; neither thisblock nor any later blocks enter.

    The output from this procedure appears below.

  • 8/17/2019 15 Building Regression Models Part2

    17/17

    - 17 -

    The middle part of the outputshows the final regressionequation.

    The output in rows 34 through 37indicates summary measures aftersuccessive blocks have entered.

    Note that the final block, theinteractions between Female andthe education dummies,is not in the final equation.

    This block did not pass the partialF  test at the 5% level.

    (k) Run the block procedure a second time, changing the order of the blocks:

    Block2 = JobGrade dummies, JobGrade=2 to JobGrade=6Block3 = EducLev dummies, EducLev=2 to EducLev=5Block4 = interactions between the Female dummy and the education dummies,

    Interaction(Female,EducLev=2) to Interaction(Female,EducLev=5)

    The regression output appears tothe right.

    Note that neither of the last twoblocks enters the equation thistime. Once the job gradedummies are in the equation, theterms including education are nolonger needed.

    The implication is that the orderof the blocks can make adifference.


Recommended