+ All Categories
Home > Documents > Lecture 26 Model Building (Chapters 20.2-20.3) HW6 due Wednesday, April 23 rd by 5 p.m. Problem...

Lecture 26 Model Building (Chapters 20.2-20.3) HW6 due Wednesday, April 23 rd by 5 p.m. Problem...

Date post: 20-Dec-2015
Category:
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
40
Lecture 26 • Model Building (Chapters 20.2- 20.3) • HW6 due Wednesday, April 23 rd by 5 p.m. • Problem 3(d): Use JMP to calculate the prediction interval rather than by hand.
Transcript

Lecture 26

• Model Building (Chapters 20.2-20.3)

• HW6 due Wednesday, April 23rd by 5 p.m.

• Problem 3(d): Use JMP to calculate the prediction interval rather than by hand.

Curvature: Midterm Problem 10B i v a r i a t e F i t o f M P G C i t y B y W e i g h t ( l b )

15

20

25

30

35

40MP

G Ci

ty

1500 2500 3000 3500 4000

W eight(lb)

-6

-22

610

Resid

ual

1500 2000 2500 3000 3500 4000

Weight(lb)

Remedy I: Transformations

• Use Tukey’s Bulging Rule to choose a transformation.

B i v a r i a t e F i t o f 1 / M P G C i t y B y W e i g h t ( l b )

0.03

0.04

0.05

0.06

0.07

1/MPG

City

1500 2500 3000 3500 4000

Weight(lb)

-0.010

0.000

0.010

Resid

ual

1500 2000 2500 3000 3500 4000

Weight(lb)

y = 0 + 1x1+ 2x2 +…+ pxp +

y = 0 + 1x + 2x2 + …+pxp +

Remedy II: Polynomial Models

Quadratic RegressionB i v a r i a t e F i t o f M P G C i t y B y W e i g h t ( l b )

15

20

25

30

35

40M

PG

City

1500 2500 3000 3500 4000

Weight(lb)

P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t |

I n t e r c e p t 4 0 . 1 6 6 6 0 8 0 . 9 0 2 2 3 1 4 4 . 5 2 < . 0 0 0 1 W e i g h t ( l b ) - 0 . 0 0 6 8 9 4 0 . 0 0 0 3 2 - 2 1 . 5 2 < . 0 0 0 1 ( W e i g h t ( l b ) - 2 8 0 9 . 5 ) ^ 2 0 . 0 0 0 0 0 3 4 . 6 3 4 e - 7 6 . 3 8 < . 0 0 0 1

-7-4-1

25

Res

idua

l

1500 2000 2500 3000 3500 4000

Weight(lb)

y01x

• First order model (p = 1)

y = 0 + 1x + 2x2 +

2 < 0 2 > 0

• Second order model (p=2)

Polynomial Models with One Predictor Variable

y01x

• First order model (p = 1)

y = 0 + 1x + 2x2 +

2 < 0 2 > 0

• Second order model (p=2)

Polynomial Models with One Predictor Variable

y = 0 + 1x + 2x2 + 3x3 +

3 < 0 3 > 0

• Third order model (p = 3)

Polynomial Models with One Predictor Variable

Interaction

• Two independent variables x1 and x2 interact if the effect of x1 on y is influenced by the value of x2.

• Interaction can be brought into the multiple linear regression model by including the independent variable x1* x2.

• Example: EducIQIQEducmeoInc **10*100*20001000ˆ

Interaction Cont.

• • “Slope” for x1=E(y|x1+1,x2)-E(y|x1,x2)= • • Is the expected income increase from an

extra year of education higher for people with IQ 100 or with IQ 130 (or is it the same)?

21322110 xxxxy

231 x

EducIQIQEducmeoInc **10*100*20001000ˆ

• First order model, two predictors,and interactiony = 0 + 1x1 + 2x2

+3x1x2 +

x1

X2 = 2

X2 = 3

X2 =10+2(1)] +[1+3(1)]x1

0+2(3)] +[1+3(3)]x1

0+2(2)] +[1+3(2)]x1

The two variables interact to affect the value of y.

• First order modely = 0 + 1x1 + 2x2 +

The effect of one predictor variable on y is independent of the effect of the other predictor variable on y.

x1

X2 = 1X2 = 2X2 = 3

0+2(1)] +1x10+2(2)] +1x10+2(3)] +1x1

Polynomial Models with Two Predictor Variables

Second order model withinteractiony = 0 + 1x1 + 2x2

+3x12 + 4x2

2+

y = [0+2(2)+4(22)]+ 1x1 + 3x12 +

Second order modely = 0 + 1x1 + 2x2

+ 3x12 + 4x2

2 +

X2 =1

X2 = 2

X2 = 3

y = [0+2(1)+4(12)]+ 1x1 + 3x12 +

x1

X2 =1

X2 = 2

X2 = 3y = [0+2(3)+4(32)]+ 1x1 + 3x1

2 +

Polynomial Models with Two Predictor Variables

5x1x2 +

Selecting a Model

• Several models have been introduced.

• How do we select the right model?

• Selecting a model:– Use your knowledge of the problem (variables

involved and the nature of the relationship between them) to select a model.

– Test the model using statistical techniques.

Selecting a Model; Example

• Example 20.1 The location of a new restaurant– A fast food restaurant chain tries to identify

new locations that are likely to be profitable.– The primary market for such restaurants is

middle-income adults and their children (between the age 5 and 12).

– Which regression model should be proposed to predict the profitability of new locations?

– Quadratic relationships between Revenue and each predictor variable should be observed. Why?

• Members of middle-class families are more likely to visit a fast food family than members of poor or wealthy families.

IncomeLow Middle High

Revenue

• Families with very young or older kids will not visit the restaurant as frequent as families with mid-range ages of kids.

age

Revenue

Low Middle High

Selecting a Model; Example• Solution

– The dependent variable will be Gross Revenue

Selecting a Model; Example

• Solution– The quadratic regression model built is

Sales = 0 + 1INCOME + 2AGE + 3INCOME2 +4AGE2 + 5(INCOME)(AGE) +Sales = 0 + 1INCOME + 2AGE + 3INCOME2 +4AGE2 + 5(INCOME)(AGE) +

Include interaction term when in doubt,and test its relevance later.

SALES = annual gross salesINCOME = median annual household income in the neighborhoodAGE = mean age of children in the neighborhood

• Example 20.2 – To verify the validity of the model proposed in example

20.1 for recommending the location of a new fast food restaurant, 25 areas with fast food restaurants were randomly selected.

– Each area included one of the firm’s and three competing restaurants.

– Data collected included (Xm20-02.jmp):• Previous year’s annual gross sales.• Mean annual household income.• Mean age of children

Selecting a Model; Example

Xm20-02Revenue Income Age

1128 23.5 10.51005 17.6 7.21212 26.3 7.6

. . .

. . .

Income sq Age sq (Income)( Age)552.25 110.25 246.75309.76 51.84 126.72691.69 57.76 199.88

. . .

. . .

Collected data

Added data

Selecting a Model; Example

Quadratic Relationships – Graphical Illustration

B iv aria te F it o f R ev en u e B y In co m e

800

900

1000

1100

1200

1300

Reve

nue

15 20 25 30 35

Income

B iv aria te F it o f R ev en u e B y A g e

800

900

1000

1100

1200

1300

Reve

nue

2.5 5.0 7.5 10.0 12.5 15.0

Age

Model ValidationSummary of Fit RSquare 0.906535 RSquare Adj 0.881939 Root Mean Square Error 44.69533 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 5 368140.38 73628.1 36.8569 Error 19 37955.78 1997.7 Prob > F

C. Total 24 406096.16 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept -1133.981 320.0193 -3.54 0.0022 Income 173.20317 28.20399 6.14 <.0001 Age 23.549963 32.23447 0.73 0.4739 Income sq -3.726129 0.542156 -6.87 <.0001 Age sq -3.868707 1.179054 -3.28 0.0039 (Income)( Age) 1.9672682 0.944082 2.08 0.0509

20.3 Nominal Independent Variables• In many real-life situations one or more independent

variables are nominal.• Including nominal variables in a regression analysis

model is done via indicator (or dummy) variables.• An indicator variable (I) can assume one out of two

values, “zero” or “one”.

I=

1 if data were collected before 19800 if data were collected after 1980

1 if the temperature was below 50o

0 if the temperature was 50o or more

1 if a degree earned is in Finance0 if a degree earned is not in Finance

Nominal Independent Variables; Example: Auction Car Price (II)

• Example 18.2 - revised (Xm18-02a)– Recall: A car dealer wants to predict the auction

price of a car.– The dealer believes now that odometer reading

and the car color are variables that affect a car’s price.

– Three color categories are considered:• White• Silver• Other colors

Note: Color is a nominal variable.

• Example 18.2 - revised (Xm18-02b)

I1 =1 if the color is white0 if the color is not white

I2 =1 if the color is silver0 if the color is not silver

The category “Other colors” is defined by:I1 = 0; I2 = 0

Nominal Independent Variables; Example: Auction Car Price (II)

• Note: To represent the situation of three possible colors we need only two indicator variables.

• Conclusion: To represent a nominal variable with m possible categories, we must create m-1 indicator variables.

How Many Indicator Variables?

• Solution– the proposed model is

y = 0 + 1(Odometer) + 2I1 + 3I2 + – The data

Price Odometer I-1 I-214636 37388 1 014122 44758 1 014016 45833 0 015590 30862 0 015568 31705 0 114718 34010 0 1

. . . .

. . . .

White car

Other color

Silver color

Nominal Independent Variables; Example: Auction Car Price

Odometer

Price

Price = 16701 - .0555(Odometer) + 90.48(0) + 295.48(1)

Price = 16701 - .0555(Odometer) + 90.48(1) + 295.48(0)

Price = 6350 - .0278(Odometer) + 45.2(0) + 148(0)

16701 - .0555(Odometer)

16791.48 - .0555(Odometer)

16996.48 - .0555(Odometer)

The equation for an“other color” car.

The equation for awhite color car.

The equation for asilver color car.

From JMP (Xm18-02b) we get the regression equationPRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)

Example: Auction Car Price The Regression Equation

From JMP we get the regression equationPRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)

A white car sells, on the average, for $90.48 more than a car of the “Other color” category

A silver color car sells, on the average, for $295.48 more than a car of the “Other color” category.

For one additional mile the auction price decreases by 5.55 cents.

Example: Auction Car Price The Regression Equation

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.8355R Square 0.6980Adjusted R Square 0.6886Standard Error 284.5Observations 100

ANOVAdf SS MS F Significance F

Regression 3 17966997 5988999 73.97 0.0000Residual 96 7772564 80964Total 99 25739561

Coefficients Standard Error t Stat P-valueIntercept 16701 184.3330576 90.60 0.0000Odometer -0.0555 0.0047 -11.72 0.0000I-1 90.48 68.17 1.33 0.1876I-2 295.48 76.37 3.87 0.0002

There is insufficient evidenceto infer that a white color car anda car of “other color” sell for adifferent auction price.

There is sufficient evidenceto infer that a silver color carsells for a larger price than acar of the “other color” category.

Xm18-02b

Example: Auction Car Price The Regression Equation

• Recall: The Dean wanted to evaluate applications for the MBA program by predicting future performance of the applicants.

• The following three predictors were suggested:– Undergraduate GPA– GMAT score– Years of work experience

• It is now believed that the type of undergraduate degree should be included in the model.

Nominal Independent Variables; Example: MBA Program Admission (

MBA II)

Note: The undergraduate degree is nominal data.

Nominal Independent Variables; Example: MBA Program Admission

(II)

I1 =1 if B.A.0 otherwise

I2 =1 if B.B.A0 otherwise

The category “Other group” is defined by:I1 = 0; I2 = 0; I3 = 0

I3 =1 if B.Sc. or B.Eng.0 otherwise

MBA Program Admission (II)

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 6 54.751842 9.12531 17.1554 Error 82 43.617378 0.53192 Prob > F

C. Total 88 98.369220 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept 0.2886998 1.396475 0.21 0.8367 UnderGPA -0.006059 0.113968 -0.05 0.9577 GMAT 0.0127928 0.001356 9.43 <.0001 Work 0.0981817 0.030323 3.24 0.0017 Degree[1] -0.443872 0.146288 -3.03 0.0032 Degree[2] 0.6068391 0.160425 3.78 0.0003 Degree[3] -0.064081 0.138484 -0.46 0.6448

20.4 Applications in Human Resources Management: Pay-Equity

• Pay-equity can be handled in two different forms:– Equal pay for equal work– Equal pay for work of equal value.

• Regression analysis is extensively employed in cases of equal pay for equal work.

Human Resources Management: Pay-Equity

• Example 20.3 (Xm20-03)– Is there sex discrimination against female

managers in a large firm?– A random sample of 100 managers was

selected and data were collected as follows:• Annual salary• Years of education• Years of experience• Gender

• Solution– Construct the following multiple regression model:

y = 0 + 1Education + 2Experience + 3Gender +

– Note the nature of the variables:• Education – Interval

• Experience – Interval

• Gender – Nominal (Gender = 1 if male; =0 otherwise).

Human Resources Management: Pay-Equity

• Solution – Continued (Xm20-03)

Human Resources Management: Pay-Equity

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.8326R Square 0.6932Adjusted R Square 0.6836Standard Error 16274Observations 100

ANOVAdf SS MS F Significance F

Regression 3 57434095083 19144698361 72.29 0.0000Residual 96 25424794888 264841613.4Total 99 82858889971

CoefficientsStandard Error t Stat P-valueIntercept -5835.1 16082.8 -0.36 0.7175Education 2118.9 1018.5 2.08 0.0401Experience 4099.3 317.2 12.92 0.0000Gender 1851.0 3703.1 0.50 0.6183

Analysis and Interpretation• The model fits the data quite well.• The model is very useful.• Experience is a variable strongly related to salary.• There is no evidence of sex discrimination.

• Solution – Continued (Xm20-03)

Human Resources Management: Pay-Equity

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.8326R Square 0.6932Adjusted R Square 0.6836Standard Error 16274Observations 100

ANOVAdf SS MS F Significance F

Regression 3 57434095083 19144698361 72.29 0.0000Residual 96 25424794888 264841613Total 99 82858889971

Coefficients Standard Error t Stat P-valueIntercept -5835.1 16082.8 -0.36 0.7175Education 2118.9 1018.5 2.08 0.0401Experience 4099.3 317.2 12.92 0.0000Gender 1851.0 3703.1 0.50 0.6183

Analysis and Interpretation • Further studying the data we find: Average experience (years) for women is 12. Average experience (years) for men is 17

• Average salary for female manager is $76,189 Average salary for male manager is $97,832

20.5 Stepwise Regression • Purposes of stepwise regression:

– Find strong predictors (stepwise forward)– Eliminate weak predictors (stepwise backward)– Prevent highly collinear groups of predictors from

collectively entering the model (they degrade pvals)

• The workings of stepwise regression:– Predictors are entered/removed one at a time– Stepwise forward: given a current model, enter the

predictor that increases R2 the most,…. if pval<0.25– Stepwise backward: …, remove the predictor that

decreases R^2 the least,…. if pval>0.10

Stepwise Regression in JMP

• “Analyze” ! “Fit Model”

• response ! Y, predictors ! “add”

• pull-down menu top right: “Standard Least Squares” ! “Stepwise”; “Run Model”

• Stepwise Fit window: updates automagically– manual stepwise: check boxes in “Entered” column,

to enter and remove predictors– stepwise forward/backward: “Step” to enter/remove

one predictor, “Go” for automatic sequential selection– “Direction” pull-down for “forward” (default),

“backward”, “mixed” selection strategies

Comments on Stepwise Regression

• Stepwise regression might not find the best model; you might find better models with manual search, where better means: fewer predictors & larger R2.

• Forward search stops when there is no predictor with pval<0.25 (can be changed in JMP).

• Backward search stops when there is no predictor with pval>0.10 (can be changed in JMP).

• Often one wants to search only models with certain predictors included. Use “Lock” column in JMP.

Practice Problems

• 20.6,20.8,20.22,20.24


Recommended