+ All Categories
Home > Documents > 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship...

1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship...

Date post: 21-Dec-2015
Category:
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
43
1 Simple Linear Simple Linear Regression Regression Chapter 16
Transcript
Page 1: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

1

Simple Linear Regression Simple Linear Regression

Chapter 16

Page 2: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

2

Introduction

• In this chapter we examine the relationship among interval variables via a mathematical equation.

• The motivation for using the technique:– Forecast the value of a dependent variable (y) from

the value of independent variables (x1, x2,…xk.).– Analyze the specific relationships between the

independent variables and the dependent variable.

Page 3: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

3

16.1 Simple Linear Regression Model

The model has a deterministic and a probabilistic components

Page 4: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

4

House size

HouseCost

Most lots sell for $25,000

Building a house costs about

75$ per square foot.

House cost = 25000 + 75(Size)

The Deterministic part of the model

Page 5: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

5

House cost = 25000 + 75(Size)

However, house cost may vary even among same size houses!

The Model

Page 6: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

6

House cost = 25000 + 75(Size)

House size

HouseCost

Most lots sell for $25,000

The Model

Since cost behave unpredictably,we add a random component.

Page 7: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

7

The Model

• The simple first order (linear) regression model is:

y = dependent variablex = independent variable0 = y-intercept

1 = slope of the line

= error variable

xy 10 xy 10

x

y

0 Run

Rise = Rise/Run

0 and 1 are unknown populationparameters, therefore are estimated from the data.

Page 8: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

8

16.2 The Least Squares Method

• The estimates are determined by – drawing a sample from the population of interest,– calculating sample statistics.– producing a straight line that cuts into the data.

Question: What is the best line to describe the specific linear relationship?

x

y

Page 9: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

9

The Least Squares (Regression) Line

A good line is considered one, that minimizes the sum of squared errors.

X

Actual value of Y

Equation value of YError

Page 10: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

10

The Least Squares (Regression) Line

3

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

(4,3.2)

2.5

Here is a short comparison of two possible lines drawn over 4 data points: 1. Horizontal line2. Positively increasing lineWhich line is better?

Page 11: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

11

The Least Squares (Regression) Line

3

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

(4,3.2)

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

2.5

The smaller the sum of squared differencesthe better the fit of the line to the data.

Page 12: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

12

The Estimated Coefficients

To calculate the estimates of 0 and 1 that minimize the differences between the data points and the line, use the formulas shown below (alternative formulae are suggested later):

xbyb

)xx()yy)(xx(

b

10

2i

ii1

xbyb

)xx()yy)(xx(

b

10

2i

ii1

The regression equation that estimatesthe equation of the first order linear modelis:

xbby 10 xbby 10

Page 13: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

13

• Example 2– A car dealer wants to find

the relationship between the odometer reading and the selling price of 3-year old Tauruses.

– A random sample of 100 cars is selected, and the data recorded.

– Find the regression line.

Car Price Odometer1 14.6 37.42 14.1 44.83 14.0 45.84 15.6 30.95 15.6 31.76 14.7 34.0. . .. . .. . .

Dependent variable y

Inependent variable x

The Simple Linear Regression Line

Page 14: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

14

• In order to more easily use Excel results, we’ll use another version of the formula for b1:

COV (the covariance of x and y) is a measure of the common ‘movement’ of x and y (do both generally increase together, or move in opposite directions)

xbyb

S

y)cov(x,b

10

2x

1

xbyb

S

y)cov(x,b

10

2x

1

The Simple Linear Regression Line

cov(x,y) is also denoted by Sxy

Page 15: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

15

-2.909

Excel provides a population covariance of -2.879. The sample covariance is:-2.879n/(n-1)=-2.879(100/(99) = -2.909

2i

ii

2i2

x

)x(x

)y)(yx(xy)cov(x,

43.5091n

)x(xs

14.841;y36.01;x

The Simple Linear Regression Line

• Solution– Solving by hand: Calculate statistics. See data in

where n = 100.17,248

011).0669)(36.(14,822.82xbyb

.066943.509

2.909s

Y)cov(X,b

10

2x

1

.0669x17.248xbby 10 ˆ

Car Price

Page 16: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

16

• Solution – continued– Using the computer:

Tools > Data analysis > Regression > [Shade the y range and the x range] > OK

The Simple Linear Regression Line

Car Price

Page 17: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

17

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100

ANOVAdf SS MS F Significance F

Regression 1 19.255607 19.255607 180.6429887 5.75078E-24Residual 98 10.446293 0.1065948Total 99 29.7019

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.1820926 94.725045 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.0049746 -13.44035 5.75078E-24 -0.076732894 -0.056989

669x17,248y 0.ˆ

The Simple Linear Regression Line

Odometer readingSelling price

Car Price

Page 18: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

18

Odometer Line Fit Plot

13000

14000

15000

16000

Odometer

Pri

ce

.0669x17,248y ˆ

Interpreting the Linear Regression -Equation

The intercept is b0 = $17,248.

0 No data

• The regression equation describes the linear relationship within the range covered by the sample only. • Thus, do not interpret the intercept as the “Price of cars that have not been driven”

17025

This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0669

Page 19: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

19

Interpreting the Linear Regression -Equation

• Remember: The regression equation pertains to the sample only!!

• To generalize the results by making inference about the population, we are about to apply statistical inference techniques.

Page 20: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

20

16.3 Model Assumptions

• The error is a critical part of the regression model.• Four requirements involving must be satisfied.

– The probability distribution of is normal.– The mean of is zero: E() = 0.– The standard deviation of is for all values of x.– The set of errors associated with different values of y are

all independent.

Page 21: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

21

The Normality of

The mean value of y for a given value of x is E(y|x) = 0 + 1x + E() = 0 + 1x, since E() = 0.

0 + 1x1

0 + 1x2

0 + 1x3

E(y|x2)

E(y|x3)

x1 x2 x3

E(y|x1)

The standard deviation remains constant,

but the mean value changes with x

Recall: y = 0+1x+

Since 0+1x is deterministicand is normally distributed,y is also normally distributed.

The standard deviation of y is for all values of y

Page 22: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

22

+

+

+

+

+

++

++

+

+

+

+

+

+

++

++

+

+

++

+

Notice, that for small values of y, and for large values of y the errorsare mostly negative, while for midrange values of y the errors are positive.The errors are not independent.Consequently, linear regression is not the correct model to work with here.One can also question the assumption of ‘The mean error is zero’.

The independence of the errorsHere is a case where linear regression is not the rightmodel to apply to.

Page 23: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

23

16.4 Assessing the Model

• The least squares method produces a regression line whether or not there is a linear relationship between x and y.

• Consequently, it is important to assess how well the linear model fits the data (how strong the linear relationship is).

• Several methods are used to assess the model. All are based on the sum of squares for errors, SSE.

Page 24: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

24

• SSE is the sum of vertical differences between the points and the regression line. It was the function we minimized when constructing the regression equation.

• It can serve as a measure of how well the line fits the data. SSE is defined by

.)yy(SSEn

1i

2ii

.)yy(SSEn

1i

2ii

Sum of Squares for Errors

2x

2Y

s

)Y,Xcov(s)1n(SSE

2x

2Y

s

)Y,Xcov(s)1n(SSE

• A shortcut formulao

xi

yi+

iy

Page 25: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

25

If is small the errors tend to be close to their mean, and the model fits the data well.Since the mean error is equal to zero, the model fits the data well when is close to zero.Therefore, we can, use as a measure of the suitability of using a linear model.To do this we need to estimate . An unbiased estimator of

2 is given by s2

2nSSE

s2ε

2nSSE

s2ε

Standard Error of Estimate

The standard error of estimateis defined by s

Page 26: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

26

Testing the Slope

Different inputs (x) yielddifferent outputs (y).

No linear relationship.Different inputs (x) yieldthe same output (y).

The slope is not equal to zero The slope is equal to zero

Linear relationship.Linear relationship.Linear relationship.Linear relationship.

Page 27: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

27

• We can draw inference about 1 from b1 by testingH0: 1 = 0H1: 1 = 0 (or < 0,or > 0)– The test statistic is

– If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2.

1b

11

sb

t

1b

11

sb

t

The standard error of b1.

2x

bs)1n(

ss

1

2x

bs)1n(

ss

1

where

Testing the Slope

Page 28: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

28

• Example 4– Test to determine whether there is enough evidence to infer

that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in example 2. Use = 5%.

– Solution. The alternative hypothesis here (H1) is of the form ‘Not equal to zero’, because we try to show that there is a linear relationship, which can be verified by either positive value or negative value for 1.

Testing the Slope,Example

Page 29: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

29

• Solving by hand– To compute “t” we need the values of b1 and sb1.

– For the two tails rejection region, t > t.025 or t < -t.025 with = n-2 = 98. From the t-table t.025 = 1.984 approximately,

13.44.00497

0.0669s

βbt

.004979)(99)(43.50

.3265

1)s(n

ss

.0669b

1

1

b

11

2x

εb

1

Testing the Slope,Example

Obtained from Descriptive Statisticsin Excel.

Page 30: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

30

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100

ANOVAdf SS MS F Significance F

Regression 1 19.2556074 19.2556074 180.6429887 5.75078E-24Residual 98 10.4462926 0.10659482Total 99 29.7019

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.18209257 94.7250453 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.00497464 -13.4403493 5.75078E-24 -0.076732894 -0.056989

• Using the computer

Testing the Slope,Example

There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.`

Car Price

Page 31: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

31

– To measure the strength of the linear relationship we use the coefficient of determination.

Coefficient of determination R2

Here the line explains allthe variation between the different ‘y’ values.

Here the line explains onlysome of the variation among the ‘y’ values (there are errors).

Page 32: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

32

Coefficient of determination

• The mathematical formulation of R2 is based on the following characteristic ( stated without a proof):

Overall variation in y =

The variation of the regression model around the mean +

The variation of the actual points around the line

That is: SST = SSR + SSE

Observe next

Page 33: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

33

Coefficient of Determination = [Variation in y explained by the linear relationship]

[Total Variation in y

Coefficient of Determination R2

2i

2i

2

)yy(SSRand)yy(SSTWhereSSTSSR

R

2i

2i

2

)yy(SSRand)yy(SSTWhereSSTSSR

R

Page 34: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

34

Coefficient of determination - insight

SST = SSR + SSE

• When the relationship between x and y is perfectly linear – there are no deviation of the actual points from the line,so SSE = 0. Therefore, SSR = SST and R2 = SSR/SST = 1

• When the relationship between x and y is not perfectly linear –there are deviations of the actual points from the regression line,so SSE > 0. Therefore, SSR < SST, and R2 = SSR/SST < 1.

• When no linear relationship exists between x and y – none of the total variation among the actual points is explainedby a linear relationship, so SST = SSE. Therefore, R2 = 0.

Page 35: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

35

• Example 3– Find the coefficient of determination for example 2; what

does this statistic tell you about the model?• Solution

– Solving by hand; we use an alternative form of the R2 formula, that makes it easier to use Excel.

.6483ss

Y)][cov(X,R 3000)(43.509)(.

2.909][2y

2x

22 2

Coefficient of determination,Example

Page 36: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

36

– Using the computer From the regression output we have

Coefficient of determination – Example

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100

ANOVAdf SS MS F Significance F

Regression 1 19.2556074 19.2556074 180.6429887 5.75078E-24Residual 98 10.4462926 0.10659482Total 99 29.7019

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.18209257 94.7250453 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.00497464 -13.4403493 5.75078E-24 -0.076732894 -0.056989

65% of the variation in the auctionselling price is explained by the variation in odometer reading. Therest (35%) remains unexplained bythis model.

Page 37: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

37

Coefficient of Correlation

• The coefficient of correlation is used to measure the strength of association between two variables.

• The coefficient values range between -1 and 1.– If r = -1 (negative association) or r = +1 (positive

association) every point falls on the regression line.– If r = 0 there is no linear pattern.

• The coefficient can be used to test for linear relationship between two variables.

Page 38: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

38

• If we are satisfied with how well the model fits the data, we can use it to predict the values of y.

• To make a prediction we use– Point prediction, and– Interval prediction

16.6 Using the Regression Equation

• Before using the regression model, we need to assess how well it fits the data.

Page 39: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

39

Point Prediction

• Example 5– Predict the selling price of a three-year-old Taurus

with 40,000 miles on the odometer (Example 2).

– It is predicted that a 40,000 miles car would sell for $14,575.

– How close is this prediction to the real price?

575,14)000,40(0623.17067x0623.17067y A point prediction

Page 40: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

40

Interval Estimates• Two intervals can be used to discover how closely the

predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – predicts the average y for a given x.

– The confidence interval– The confidence interval

2x

2g

2 s)1n()xx(

n1

sty

2x

2g

2 s)1n()xx(

n1

sty

– The prediction interval– The prediction interval

2x

2g

2 s)1n()xx(

n1

1sty

2x

2g

2 s)1n()xx(

n1

1sty

Page 41: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

41

Interval Estimates,Example

• Example 5 - continued – Provide an interval estimate for the bidding price on

a Ford Taurus with 40,000 miles on the odometer.– Two types of predictions are required:

• A prediction for a specific car• A prediction for the mean price per car

Page 42: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

42

Interval Estimates,Example

• Solution– A prediction interval provides the price estimate for a

single car:

2x

2g

1n,2 s)1n(

)xx(

n1

1sty

65214.5741)(43.509)(10036.011)(40

1001

15)1.984(.326.0669(40)][17.0252

xg

t.025,98

Approximately

t/2.n-1 s 2xs

xxg

b1b0

Page 43: 1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

43

• Solution – continued– A confidence interval provides the estimate of the

mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.

• The confidence interval (95%) =

2

i

2g

2)xx(

)xx(

n1

sty

.07614.5741)(43.509)(10036.011)(40

1001

5)1.984(.32614.5742

Interval Estimates,Example


Recommended