Simple LR Lecture.ppt

Simple Linear Regression

Simple Linear Regression

• Regression is a statistical method that attempts to represent the relationship between two variables by approximating this relationship by a straight line– Since all relationships are not linear (straight line) in

fashion, simple LR only works well for bivariate data that has a linear relationship

– Regression analysis develops an linear equation showing how the two variable are related

• Requires a slope (b0) and Y-intercept (b1)

Computing Regression Line

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

time

data

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

time

data

Computing Regression Line

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

time

data

e1

e2

e3

e4

Computing Regression Line: Least Squares Line

• The least square line is the straight line that best passes through the points of a scatter diagram

• The least squares line is the line through the data that minimizes the sum of the differences between the observations and the line (these differences are commonly called as residuals). e2 = e1

2 + e22 + e3

2 + … + en2

Sum of Squares of Error

• How can we determine the sum of squares of error?– By using the following formulas, remember, the

regression line minimizes this value (SSE)

n

iii

n

iyySSEe

1

2

1

2 )ˆ(

ii xbby 10

Y-hat is the y value from regression line

Y – is the actual observed y value

Least Squares Formula

xx

xy

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

n

iii

SS

xx

yyxx

xxn

yxyxnb

1

21

2

11

2

1111

xbyb 10

Straight-Line Relationship• Y = b0 + b1X

• b0 represents the Y-intercept which is the value of Y if X = 0.

• b1 is the slope of the line which is the amount of change in Y for a unit change in X.

Assumptions of Simple LR

1 2 3 4

01

PDF of Y at x=1

PDF of Y at x=2

PDF of Y at x=3

PDF of Y at x=4

Y/X=1

Y/X=2

Y/X=3

Y/X=4

Note

xxyE 10)|( xbby 10ˆ

xy 10 exbby 10ˆDeterministic model representation for mean of Y given X but not for actual y value. In any case, the derived LR equation,

as long as proven stable and reliable could be used to predict either the average y-value or the actual y-value.

Assumptions of Simple LR

2)/(

0)/(

XVar

XE Zero mean

Constant, homogenous/homoscedastic variance

is normally distributed

Value of associated with any particular value of Y is independent of associated with any other value of Y. As if errors come from a random sample.

tindependenareandxYxY

21

22102

11101

Another Note

),0(~/),0(~/ 22 NXYNX An unbiased estimate of 2 is:

n

iiyy

xyyy

yyS

nSbS

nSSEs

1

2

12

)(

22

Inferences on Regression CoefficientsOn 1 using b1:

Using confidence interval:

Using hypothesis testing:

On 0 using b0:Using confidence interval:

Using hypothesis testing:

xxSst

b 2/1

xxSsbt/

1

xx

n

ii

nS

xstb

1

22/

0

xx

n

ii

nS

xs

bt

1

2

00

v=n-2 for all inferences

Hypothesis Test on theSlope of the Regression Line

• How do we know that a significant linear relationship exists using regression?– The slope, b1, will give us an indication.


• Therefore, if the slope of the least square line is zero, there is no linear relationship. However, if the slope of the least squares line is significantly greater than 0 or is significantly less than 0, then we can conclude a linear relationship exists

• Therefore, we want to test the following hypothesis:

Ho : 1 = 0 (X provides no information)Ha : 1 0 (X does provide information)


1

1*

bsbt

1

1*

bsbt

1

1*

bsbt

Ho : 1 = 0Ha : 1 0

Reject Ho if |t*| > t/2, n-2

Ho : 1 ≤ 0Ha : 1 > 0

Ho : 1 ≥ 0Ha : 1 < 0

Reject Ho if t*> t, n-2 Reject Ho if t* < -t, n-2

Note: 2-ndf and xx

b Sss 1

PredictionOn confidence interval on mean response y/X0:

xxS

xx

nsty

2

0

2/01

On prediction interval on single response y0:

xxS

xx

nsty

2

0

2/011

Example: Salary and Experience• Salary vs. Years Experience

– For n = 6 employees– Linear (straight line) relationship– Increasing relationship

• higher salary generally goes with higher experience– Correlation r = 0.8667

2030405060

0 10 20 ExperienceSala

ry ($

thou

sand

)Experience151020

515

5

Salary303555224027

Mary earns $55,000

per year, and has

20 years of experience

• Summarizes bivariate data: Predicts Y from X– with smallest errors (in vertical direction, for Y axis)– Intercept is 15.32 salary (at 0 years of experience)– Slope is 1.673 salary (for each additional year of experience, on average)

10

20

3040

50

60

0 10 20 Experience (X)

Sala

ry (Y

)

Salary = 15.32 + 1.673 Experience

Y = b0 + b1X

Example: Salary and Experience

Predicted Values and Residuals• Predicted Value comes from the prediction equation

y = b0 + b1X = 15.32 + 1.673X• For example, Mary (with 20 years of experience) has a

predicted salary = 15.32 + 1.673(20) = 48.8• So does anyone with 20 years of experience

• Residual is the actual Y minus predicted Y (Y – Ŷ)– Mary’s residual is 55 – 48.8 = 6.2

• She earns about $6,200 more than the predicted salary for a person with 20 years of experience

• A person who earns less than predicted will have a negative residual

Predicted and Residual (continued)

10

20

30

40

50

60

0 10 20Experience

Sala

ry

Mary earns 55 thousand

Mary’s predicted value is 48.8

Mary’s residual is 6.2 (55 – 48.8)

Simple Linear Regression Model

• When we use a straight line to predict parameters, we use a statistical model in the form:

eXY 10 Assumed line about which all values of X and Y will fall

Error: contains all other variability not explained by the independent variable (X)

Note: 0 and 1 refer to the straight line for the population, we will be using sample data and will use b0 and b1 to refer to the straight line for the sample

Error Variance• The measures most commonly used to measure

how well a line fits through a set of points is to use the error variance and error standard deviation

2

nSSE s

• What is s?• Measure of the variation of the Y values around the

least squares line • Average distance of prediction from actual

• Average size of residuals• Standard deviation of residuals

22

nSSE s

• Interpretation: similar to standard deviation• Can move least-squares line up and down by s

– About 68% of the data are within one “standard error of estimate” of the least-squares line• (For a bivariate normal distribution)

20

30

40

50

60

0 10 20Experience

Sala

ry

(Least-squares lin

e) + S

(Least-squares lin

e) – S

Error Variance

• Regression and Prediction Error– Predicting Y as Ŷnot using regression)

• Errors are approximately SY = 11.686

– Predicting Y as b0 + b1X (using regression)• Errors are approximately S = 6.52• Errors are smaller when regression is used!

– This is often the true payoff for using regression

Example: Salary and Experience

Measuring the Strength of the Model

• Another item of interest is to determine how well the regression model fits the data

• To determine this, we use the coefficient of determination (r2) which gives the percentage of explained variation in the dependent variable using the model.

Coefficient of Determinationr 2 coefficient of determination

1 SSES

percentage of explained variation in the dependentvariable using the simple linear regression model

YY

Getting the square root of the coefficient of determination gives the correlation coefficient – r.

Correlation Coefficient

• The sample correlation coefficient, r, measures the strength of the linear relationship that exists within a sample of n bivariate data.

yyxx

xy

yy

xxSS

SSSbr 1

Note: When one compute r by getting the square root of r2, affix the sign of b1 to the final value.

Interpreting the Correlation Coefficient

• If r = 1 then X and Y have a perfect positive linear relationship.

• If r= -1 then X and Y have a perfect negative linear relationship.

• If r= 0 then X and Y have no linear relationship.• If 0 < r < 1 then X and Y are positively related. The closer

to 1 the stronger the linear relationship.• If 0 > r > -1 then X and Y are negatively related. The

closer to -1 the stronger the linear relationship.

Correlation Coefficient Summary

• r ranges from -1.0 to 1.0.• The larger | r | is, the stronger the linear relationship.• R near zero indicates that there is no linear relationship. X

and Y are uncorrelated• The sign of r tells you whether the relationship between X

and Y is a positive or a negative relationship.• The value of r tells you very little about the slope of the

line. Except if the sign of r is positive the slope of the line is positive and if r is negative then the slope is negative.

Examples: Interpreting Correlation• rxy = 1

• A perfect straight line tilting up to the right

X

Y

X

Y

• rxy = 0• No overall tilt• No linear relationship?

X

Y

X

Y

• rxy = – 1• A perfect straight line

tilting down to the right X

Y

X

Y

Various Values of rxy

Significance test for the Correlation

• Do to sampling error, the value of r may not reflect the true relationship of the entire population, especially if the sample is quite small

• Therefore, a formal hypothesis test may be needed• Hypothesis tested:

– Ho: = 0 (no correlation)– Ha: ≠ 0 (correlation exists)

Note: = population correlation coefficient

Hypothesis Test

1

1*

bsbt

Ho : 1 = 0 (no linear relationship exists)Ha : 1 0 ( a linear relationship exists)

Reject Ho if |t| > t, n-2

Another way to test if a linear relationship exist between the two variables of interest is to use the relationship between b1 and r (they are closely related)

This will give exactly the same value for t* as

21 2

*

nr

rt

Hypothesis Test Continued

• If one desires to carry out the general test:Ho : = 0

Ha : 0 / > 0 / < 0

One can use:

)1)(1()1)(1(ln

23

0

0

rrnz

This works on the assumption that both X and Y follows the bivariate normal distribution.

Exercise Problems Problems

• Problem #5, p. 359 (manual)• Problem #6, p. 359 (excel)• Problem #7, p. 359 (excel)• Problem #7, p. 371.• Problems #1 and 2, p.396• Problem #5 p. 380

Check for Model Significance and Adequacy

• The ANOVA approach:

n

iii

n

ii

n

ii yyyyyy

1

2

1

2

1

2 )()()(

SSESSRSST

SSEbSS xyyy

Similar to testing:Ho : 1 = 0Ha : 1 0


The ANOVA table: Sources of Variation

Sum of Squares

Dof Mean Square

Comp. F

Regression SSR 1 SSR SSR/s2

Error SSE n-2 SSE/n-2 Total SST n-1

Reject H0 if comp F > F(1,n-2)

Check for Model Significance and Adequacy• If repeated observations are made at several X values

the SSE term shown previously could be further divided into Error-Lack of Fit and Error-Pure Experimental.

• Computational formula:Y ij = the jth value of the random variable Y i

Y i. = T i. =

jn

jijy

1

i

iinTY ..

kn

yy

kn

sns

n

yys

k

i

n

jiij

k

iii

i

n

jiij

i

j

i

1 1

2.

1

2

2

1

2.

2

)()1(

1

)(

ni = no. of observations at xi

k = no. distinct values of x

s2 = pure experimental error

mean square SSE(pure)

SS(Lack of Fit)= SSE-SSE(pure)


The ANOVA table becomes: Sources of Variation

Sum of Squares

Dof Mean Square (MS)

Comp. F

Regression SSR 1 SSR SSR/s2

Error SSE n-2 Lack of Fit SSE- SSE(pure) k-2 SSE- SSE(pure)

k-2 MS(lack of fit) s2

Pure Error SSE(pure) n-k s2 Total SST n-1 Model significant if Freg > F(1,n-k)

Model adequate if Flack of fit < F(k-2,n-k)

Model Adequacy

• Significant lack of fit means that there is considerable variation being caused by higher-ordered terms- these are terms in x other than the linear or first-order terms.

• Illustration given in Figure 11.11 and 11.12 pp. 378-379 of book

Checking Model Assumptions1. The errors are normally distributed with a mean of zero.

• Construct a normal probability plot (plot of residuals)• If the resulting graph is linear, the normality assumption is verified

• Conduct goodness-of-fit test- chi-sqaure, KS, Shapiro-Wilcoxon• Statistical test on kurtosis and skewness

2. The variance of the error component is the same for each value of X.

• Plot residual against independent variable, X• If no pattern exist, this assumption holds

3. The errors are independent of each other.• Look for autocorrelation

• Plot sample residuals by time – time series analysis

Normal Probability Plot• Compares the cumulative distribution of actual data values

with the cumulative distribution of a normal distribution. (If normal points should fall around the diagonal straight line.)

Deviations from Normality

Statistical Checking for Normality

• Deviations from NormalityKurtosis – refers to the “peakedness” or

“flatness” of the distribution. If normal,kurtosis is zero.

Skewness -deals with the symmetry of the distribution, a skewed variable is a variable whose mean is not in the center of the distribution. If normal, skewness is zero.

Checking for Normality

• Statistical Test for Kurtosis

• Statistical Test for Skewness

N

kurtosisz24

N

skewnessz6

Checking for Normality

• Other ToolsHistogram (good for numerous data points)

Goodness of Fit Tests (good for many data points- 30 or more; but overly sensitive for very large sample- 1000 or more)

Checking Model Assumptions

A: assumption holds

Equal Variance

B: assumption violated

Checking Model Assumptions

Autocorrelation exists, violation of errors being independent of each other

autocorrelation

Importance of Assumptions

• Normality – t-test, F-test.• Homoscedasticity- t-test, F-test, to ensure

variance used in explanation and prediction is distributed across the range of independent variable value

• Absence of Correlated Errors – confidence that prediction errors are independent of the levels at which one is trying to predict, assurance that no other systematic variable is affecting the results and left out of the analysis

On Violation of Assumptions

• One violation can be the result of another. Example: violation of non-normality is linked to or can be the result of non-constant variance.

• A remedy applied to one can solve another.• Remedy available: data or variable

transformation

Notes on Transformation• Two Purposes:

1. Correct violations of statistical assumptions.2. Improve correlation between variables.

• How to Choose?1. Theoretical basis – nature of data (e.g. sqrt transform works well with frequency count data, arcsin transform for proportion data)

2. Trial and Error• Not a magic cure to all violations. Will not

eliminate all violations but could lead to very significant improvements.

Suggested Transforms for Non-normality

Note: Inverse transform usually works well with “flat” distributions.

Suggested Transforms for Heteroscedasticity

• If cone opens to the right – try inverse transform

• If cone opens to the left – try square root transform

Some General Guidelines on Transformation

• For noticeable effect of transformation, ratio of variable mean to std. dev < 4.

• If transformation can be performed on two variables (in non-linearity) , select variable with smallest average to s ratio.

• Transformation should be applied to IVs except for cases of heteroscedasticity. If relationship is heteroscedastic and non-linear there might be a need to transform both IV and DV.

• Transformation may change the interpretation of variables. Be careful!

Suggested Transforms for Non-linearity

In any of the given illustrations, transformation could be carried out on either the independent or dependent variable. When multiple transformation possibilities are shown, start with the top method in each quadrant and move downward until linearity is achieved.

Simple Non-linear Regression by Linearization

Function Proper Transformation

Form of Simple Linear Regression

Exponential: Y=ex

Y* = ln y Regress y* vs x

Power: Y=x

Y* = log y X* = log x

Regress y* against x*

Reciprocal: Y=+(1/X)

X* = 1/X Regress y against x*

Hyperbolic: Y=

xx

Y* = 1/Y X* = 1/X

Regress y8 against x*

Notes on Non-Linear Regression by Linearization

• Model in the transformed variables that has a proper additive error structure is a result of a model in the natural variables with a different type of error structure.

• Performance criteria (s2 and R2) for the transformed model should be based on values of the residuals in the metric of the untransformed response.

Sample Problem to be given in class.

Obtaining The Regression Output •To fit a linear regression using Excel

– Choose Data Analysis, then Regression– Choose the two data columns for which the

Regression is to be calculated– The Y variable will be on the vertical axis– Click residual plot and normal probability plot

to check assumptions

•Statistica could also be used.

Caveats in Simple LR• Linear Model May Be Wrong

– Nonlinear? Unequal variability? Clustering?• Predicting Intervention from Experience is Hard

– Relationship may become different if you intervene• Intercept May Not Be Meaningful

– if there are no data near X = 0• Explaining Y from X vs. Explaining X from Y

– Use care in selecting the Y variable to be predicted• Is there a hidden “Third Factor”?

– Use it to predict better with multiple regression

Date post:	07-Jul-2016
Category:	Documents
Upload:	earl-kristof-li-liao
View:	225 times
Download:	3 times

Simple LR Lecture.ppt

Documents