Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | nero-henson |
View: | 94 times |
Download: | 3 times |
LECTURE 3Introduction to Linear Regression and Correlation Analysis
1 Simple Linear Regression
2 Regression Analysis
3 Regression Model Validity
Goals
After this, you should be able to: Interpret the simple linear regression equation
for a set of dataUse descriptive statistics to describe the
relationship between X and YDetermine whether a regression model is
significant
Goals
After this, you should be able to:
Interpret confidence intervals for the regression coefficients
Interpret confidence intervals for a predicted value of Y
Check whether regression assumptions are satisfied
Check to see if the data contains unusual values
(continued)
Introduction to Regression Analysis
Regression analysis is used to: Predict the value of a dependent variable based on the value of at
least one independent variable Explain the impact of changes in an independent variable on the
dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain the dependent variable
Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is described by a linear function
Changes in y are assumed to be caused by changes in x
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
εxββy 10 Linear component
Population Linear Regression
The population regression model:
Population y intercept
Population SlopeCoefficient
Random Error term, or residualDependent
Variable
Independent Variable
Random Error component
Linear Regression Assumptions
The underlying relationship between the x variable and the y variable is linear
The distribution of the errors has constant variability Error values are normally distributed Error values are independent (over time)
Population Linear Regression
Random Error for this x value
y
x
Observed Value of y for xi
Predicted Value of y for xi
εxββy 10
xi
Slope = β1
Intercept = β0
εi
xbby 10i
The sample regression line provides an estimate of the population regression line
Estimated Regression Model
Estimate of the regression
intercept
Estimate of the regression slope
Estimated (or predicted) y value
Independent variable
b0 is the estimated average value of
y when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a one-unit change in x
Interpretation of the Slope and the Intercept
Finding the Least Squares Equation
The coefficients b0 and b1 will be found using computer software, such as Excel’s data analysis add-in or MegaStat
Other regression measures will also be computed as part of computer-based regression analysis
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y) = house price in $1000
Independent variable (x) = square feet
Sample Data for House Price Model
House Price in $1000s(y)
Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression output from Excel – Data – Data Analysis or MegaStat – Correlation/ regression MegaStat – Correlation/ regression
MegaStat OutputThe regression equation is:
feet) (square 0.10977 98.24833 price house Predicted
Regression Analysis
r² 0.581 n 10 r 0.762 k 1
Std. Error 41.330 Dep. Var. Price($000)
ANOVA tableSource SS df MS F p-value
Regression 18,934.9348 1 18,934.9348 11.08 .0104Residual 13,665.5652 8 1,708.1957
Total 32,600.5000 9
Regression output confidence intervalvariables coefficients std. error t (df=8) p-value 95% lower 95% upperIntercept 98.2483
Square feet 0.1098 0.0330 3.329 .0104 0.0337 0.1858
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
Ho
use
Pri
ce (
$100
0s)
Graphical Presentation
House price model: scatter plot and regression line
feet) (square 0.10977 98.24833 price house
Slope = 0.10977
Intercept = 98.248
Interpretation of the Intercept, b0
b0 is the estimated average value of Y when the value of X is zero
(if x = 0 is in the range of observed x values)
Here, houses with 0 square feet do not occur, so b0 = 98.24833 just
indicates the height of the line.
feet) (square 0.10977 98.24833 price house
Interpretation of the Slope Coefficient, b1
b1 measures the estimated change in Y as a result of a one-unit
increase in X
feet) (square 0.10977 98.24833 price house
Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
Least Squares Regression Properties
The simple regression line always passes through the mean of the y variable and the mean of the x variable
The least squares coefficients are unbiased estimates of β0 and
β1
The percentage of variability in Y that can be explained by variability in X.
Coefficient of Determination, R2
Note: In the single independent variable case, the coefficient of determination is
where:R2 = Coefficient of determination
r = Simple correlation coefficient
22 rR
R2 = 1, correlation = +1
Examples ofR2 Values
y
x
y
x
R2 = 1
R2 = 1, correlation = -1
Perfect linear relationship between x and y:
100% of the variation in y is explained by variation in x
Examples of Approximate R2 Values
y
x
y
x
0 < R2 < 1, correlation is negative
Weaker linear relationship between x and y:
Some but not all of the variation in y is explained by variation in x
0 < R2 < 1, correlation is positive
Examples of Approximate R2 Values
R2 = 0
No linear relationship between x and y:
The value of Y does not depend on x. (None of the variation in y is explained by variation in x)
y
xR2 = 0
Excel Output
58.08% of the variation in house prices is explained by
variation in square feet
Regression Analysis
r² 0.581
r 0.762
Std. Error 41.330
The correlation of .762 shows a fairly strong direct
relationship.
The typical error in predicting Price is 41.33($000) = $41,330
Inference about the Slope: t Test t test for a population slope
Is there a linear relationship between x and y? Null and alternative hypotheses
H0: β1 = 0 (no linear relationship) Ha: β1 0 (linear relationship does exist)
Obtain p-value from ANOVA or across from the slope coefficient (they are the same in simple regression)
House Price in $1000s
(y)
Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.) 0.1098 98.25 price house
Estimated Regression Equation:
The slope of this model is 0.1098
Does square footage of the house affect its sales price?
Inference about the Slope: t Test
(continued)
Inferences about the Slope: t Test Example
H0: β1 = 0
Ha: β1 0
We can be 98.96% confident that square feet is related to house price.
From Excel output:
Reject H0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
P-value
Decision:
Conclusion:
Regression Analysis for DescriptionConfidence Interval Estimate of the Slope:
Excel Printout for House Prices:
We can be 95% confident that house prices increase by between $33.74 and $185.80 for a 1 square foot increase.
Coefficient
sStandard
Error t Stat P-value Lower 95%Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Estimates of Expected yfor Different Values of x
y
x xp
y = b0 + b1x
x
y
The relationship describes how x impacts your estimate from y
yp
Interval Estimates for Different Values of x
y
x
Prediction Interval for an individual y, given xp The father from x the less accurate the prediction.
xp
y = b0 + b1x
x
y
House Price in $1000s
(y)
Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.) 0.1098 98.25 price house
Estimated Regression Equation:
Example: House Prices
Predict the price for a house with 2000 square feet
317.85
0)0.1098(200 98.25
(sq.ft.) 0.1098 98.25 price house
Example: House PricesPredict the price for a house with 2000 square feet:
The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850
(continued)
Estimation of Individual Values: Example
Find the 95% confidence interval for an individual house with 2,000 square feet
Predicted Price Yi = 317.85 ($1,000s) = $317, 850
MegaStat will give both the predicted value as well as the lower and upper limits
Prediction Interval Estimate for y|xp
The prediction interval endpoints are from $215,503 to $420,065. We can be 95% confident that the price of a 2000 ft2 home will fall within those limits.
Predicted values for: Price($000) 95% Confidence Interval 95% Prediction Interval
Square feet Predicted lower upper lower upper
2,000 317.784 280.664 354.903 215.503 420.065
Residual Analysis Purposes
Check for linearity assumption Check for the constant variability assumption for all levels of predicted Y Check normal residuals assumption Check for independence over time
Graphical Analysis of Residuals Can plot residuals vs. x and predicted Y Can create NPP of residuals to check for normality (or use
Skewness/Kurtosis) Can check D-W statistic to confirm independence
Residual Analysis for Linearity
Not Linear Linear
x
resi
dua
ls
x
y
x
y
x
resi
dua
ls
Residual Analysis for Constant Variance
Non-constant variance Constant variance
Ŷ Ŷ
y
x x
y
resi
dua
ls
resi
dua
ls
Residual Analysis for Normality Can create NPP of residuals to check for normality. If you see an
approximate straight line residuals are acceptably normal. You can also use Skewness/Kurtosis. If both are within + 1 the residuals are acceptably normal
Residual Analysis for Independence– Can check D-W statistic to confirm
independence. If D-W statistic is greater than 1.3 the residuals are acceptably independent. Needed only if the data is collected over time.
Checking Unusual Data Points Check for outliers from the predicted values
(studentized and studentized deleted residuals do this; MegaStat highlights in blue)
Check for outliers on the X-axis; they are indicated by large leverage values; more than twice as large as the average leverage. MegaStat highlights in blue.
Check Cook’s Distance which measures the harmful influence of a data point on the equation by looking at residuals and leverage together. Cook’s D > 1 suggests potentially harmful data points and those points should be checked for data entry error. MegaStat highlights in blue based on F distribution values.
Patterns of Outliers
a). Outlier is extreme in both X and Y but not in pattern. The point is unlikely to alter regression line.
b). Outlier is extreme in both X and Y as well as in the overall pattern. This point will strongly influence regression line
c). Outlier is extreme for X nearly average for Y. The further it is away from the pattern the more it will change the regression.
d). Outlier extreme in Y not in X. The further it is away from the pattern the more it will change the regression.
e). Outlier extreme in pattern, but not in X or Y. Slope may not be changed much but intercept will be higher with this point included.
Summary Introduced simple linear regression analysis Calculated the coefficients for the simple linear regression equation measures of strength (r, R2 and se)
Summary
Described inference about the slope Addressed prediction of individual values Discussed residual analysis to address assumptions of
regression and correlation Discussed checks for unusual data points
(continued)