Date post: | 07-Jul-2016 |
Category: |
Documents |
Upload: | earl-kristof-li-liao |
View: | 225 times |
Download: | 3 times |
Simple Linear Regression
Simple Linear Regression
• Regression is a statistical method that attempts to represent the relationship between two variables by approximating this relationship by a straight line– Since all relationships are not linear (straight line) in
fashion, simple LR only works well for bivariate data that has a linear relationship
– Regression analysis develops an linear equation showing how the two variable are related
• Requires a slope (b0) and Y-intercept (b1)
Computing Regression Line
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
time
data
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
time
data
Computing Regression Line
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
time
data
e1
e2
e3
e4
Computing Regression Line: Least Squares Line
• The least square line is the straight line that best passes through the points of a scatter diagram
• The least squares line is the line through the data that minimizes the sum of the differences between the observations and the line (these differences are commonly called as residuals). e2 = e1
2 + e22 + e3
2 + … + en2
Sum of Squares of Error
• How can we determine the sum of squares of error?– By using the following formulas, remember, the
regression line minimizes this value (SSE)
n
iii
n
iyySSEe
1
2
1
2 )ˆ(
ii xbby 10
Y-hat is the y value from regression line
Y – is the actual observed y value
Least Squares Formula
xx
xy
n
ii
n
iii
n
ii
n
ii
n
ii
n
ii
n
iii
SS
xx
yyxx
xxn
yxyxnb
1
21
2
11
2
1111
xbyb 10
Straight-Line Relationship• Y = b0 + b1X
• b0 represents the Y-intercept which is the value of Y if X = 0.
• b1 is the slope of the line which is the amount of change in Y for a unit change in X.
Assumptions of Simple LR
1 2 3 4
01
PDF of Y at x=1
PDF of Y at x=2
PDF of Y at x=3
PDF of Y at x=4
Y/X=1
Y/X=2
Y/X=3
Y/X=4
Note
xxyE 10)|( xbby 10ˆ
xy 10 exbby 10ˆDeterministic model representation for mean of Y given X but not for actual y value. In any case, the derived LR equation,
as long as proven stable and reliable could be used to predict either the average y-value or the actual y-value.
Assumptions of Simple LR
2)/(
0)/(
XVar
XE Zero mean
Constant, homogenous/homoscedastic variance
is normally distributed
Value of associated with any particular value of Y is independent of associated with any other value of Y. As if errors come from a random sample.
tindependenareandxYxY
21
22102
11101
Another Note
),0(~/),0(~/ 22 NXYNX An unbiased estimate of 2 is:
n
iiyy
xyyy
yyS
nSbS
nSSEs
1
2
12
)(
22
Inferences on Regression CoefficientsOn 1 using b1:
Using confidence interval:
Using hypothesis testing:
On 0 using b0:Using confidence interval:
Using hypothesis testing:
xxSst
b 2/1
xxSsbt/
1
xx
n
ii
nS
xstb
1
22/
0
xx
n
ii
nS
xs
bt
1
2
00
v=n-2 for all inferences
Hypothesis Test on theSlope of the Regression Line
• How do we know that a significant linear relationship exists using regression?– The slope, b1, will give us an indication.
Hypothesis Test on theSlope of the Regression Line
• Therefore, if the slope of the least square line is zero, there is no linear relationship. However, if the slope of the least squares line is significantly greater than 0 or is significantly less than 0, then we can conclude a linear relationship exists
• Therefore, we want to test the following hypothesis:
Ho : 1 = 0 (X provides no information)Ha : 1 0 (X does provide information)
Hypothesis Test on theSlope of the Regression Line
1
1*
bsbt
1
1*
bsbt
1
1*
bsbt
Ho : 1 = 0Ha : 1 0
Reject Ho if |t*| > t/2, n-2
Ho : 1 ≤ 0Ha : 1 > 0
Ho : 1 ≥ 0Ha : 1 < 0
Reject Ho if t*> t, n-2 Reject Ho if t* < -t, n-2
Note: 2-ndf and xx
b Sss 1
PredictionOn confidence interval on mean response y/X0:
xxS
xx
nsty
2
0
2/01
On prediction interval on single response y0:
xxS
xx
nsty
2
0
2/011
Example: Salary and Experience• Salary vs. Years Experience
– For n = 6 employees– Linear (straight line) relationship– Increasing relationship
• higher salary generally goes with higher experience– Correlation r = 0.8667
2030405060
0 10 20 ExperienceSala
ry ($
thou
sand
)Experience151020
515
5
Salary303555224027
Mary earns $55,000
per year, and has
20 years of experience
• Summarizes bivariate data: Predicts Y from X– with smallest errors (in vertical direction, for Y axis)– Intercept is 15.32 salary (at 0 years of experience)– Slope is 1.673 salary (for each additional year of experience, on average)
10
20
3040
50
60
0 10 20 Experience (X)
Sala
ry (Y
)
Salary = 15.32 + 1.673 Experience
Y = b0 + b1X
Example: Salary and Experience
Predicted Values and Residuals• Predicted Value comes from the prediction equation
y = b0 + b1X = 15.32 + 1.673X• For example, Mary (with 20 years of experience) has a
predicted salary = 15.32 + 1.673(20) = 48.8• So does anyone with 20 years of experience
• Residual is the actual Y minus predicted Y (Y – Ŷ)– Mary’s residual is 55 – 48.8 = 6.2
• She earns about $6,200 more than the predicted salary for a person with 20 years of experience
• A person who earns less than predicted will have a negative residual
Predicted and Residual (continued)
10
20
30
40
50
60
0 10 20Experience
Sala
ry
Mary earns 55 thousand
Mary’s predicted value is 48.8
Mary’s residual is 6.2 (55 – 48.8)
Simple Linear Regression Model
• When we use a straight line to predict parameters, we use a statistical model in the form:
eXY 10 Assumed line about which all values of X and Y will fall
Error: contains all other variability not explained by the independent variable (X)
Note: 0 and 1 refer to the straight line for the population, we will be using sample data and will use b0 and b1 to refer to the straight line for the sample
Error Variance• The measures most commonly used to measure
how well a line fits through a set of points is to use the error variance and error standard deviation
2
nSSE s
• What is s?• Measure of the variation of the Y values around the
least squares line • Average distance of prediction from actual
• Average size of residuals• Standard deviation of residuals
22
nSSE s
• Interpretation: similar to standard deviation• Can move least-squares line up and down by s
– About 68% of the data are within one “standard error of estimate” of the least-squares line• (For a bivariate normal distribution)
20
30
40
50
60
0 10 20Experience
Sala
ry
(Least-squares lin
e) + S
(Least-squares lin
e) – S
Error Variance
• Regression and Prediction Error– Predicting Y as Ŷnot using regression)
• Errors are approximately SY = 11.686
– Predicting Y as b0 + b1X (using regression)• Errors are approximately S = 6.52• Errors are smaller when regression is used!
– This is often the true payoff for using regression
Example: Salary and Experience
Measuring the Strength of the Model
• Another item of interest is to determine how well the regression model fits the data
• To determine this, we use the coefficient of determination (r2) which gives the percentage of explained variation in the dependent variable using the model.
Coefficient of Determinationr 2 coefficient of determination
1 SSES
percentage of explained variation in the dependentvariable using the simple linear regression model
YY
Getting the square root of the coefficient of determination gives the correlation coefficient – r.
Correlation Coefficient
• The sample correlation coefficient, r, measures the strength of the linear relationship that exists within a sample of n bivariate data.
yyxx
xy
yy
xxSS
SSSbr 1
Note: When one compute r by getting the square root of r2, affix the sign of b1 to the final value.
Interpreting the Correlation Coefficient
• If r = 1 then X and Y have a perfect positive linear relationship.
• If r= -1 then X and Y have a perfect negative linear relationship.
• If r= 0 then X and Y have no linear relationship.• If 0 < r < 1 then X and Y are positively related. The closer
to 1 the stronger the linear relationship.• If 0 > r > -1 then X and Y are negatively related. The
closer to -1 the stronger the linear relationship.
Correlation Coefficient Summary
• r ranges from -1.0 to 1.0.• The larger | r | is, the stronger the linear relationship.• R near zero indicates that there is no linear relationship. X
and Y are uncorrelated• The sign of r tells you whether the relationship between X
and Y is a positive or a negative relationship.• The value of r tells you very little about the slope of the
line. Except if the sign of r is positive the slope of the line is positive and if r is negative then the slope is negative.
Examples: Interpreting Correlation• rxy = 1
• A perfect straight line tilting up to the right
X
Y
X
Y
• rxy = 0• No overall tilt• No linear relationship?
X
Y
X
Y
• rxy = – 1• A perfect straight line
tilting down to the right X
Y
X
Y
Various Values of rxy
Significance test for the Correlation
• Do to sampling error, the value of r may not reflect the true relationship of the entire population, especially if the sample is quite small
• Therefore, a formal hypothesis test may be needed• Hypothesis tested:
– Ho: = 0 (no correlation)– Ha: ≠ 0 (correlation exists)
Note: = population correlation coefficient
Hypothesis Test
1
1*
bsbt
Ho : 1 = 0 (no linear relationship exists)Ha : 1 0 ( a linear relationship exists)
Reject Ho if |t| > t, n-2
Another way to test if a linear relationship exist between the two variables of interest is to use the relationship between b1 and r (they are closely related)
This will give exactly the same value for t* as
21 2
*
nr
rt
Hypothesis Test Continued
• If one desires to carry out the general test:Ho : = 0
Ha : 0 / > 0 / < 0
One can use:
)1)(1()1)(1(ln
23
0
0
rrnz
This works on the assumption that both X and Y follows the bivariate normal distribution.
Exercise Problems Problems
• Problem #5, p. 359 (manual)• Problem #6, p. 359 (excel)• Problem #7, p. 359 (excel)• Problem #7, p. 371.• Problems #1 and 2, p.396• Problem #5 p. 380
Check for Model Significance and Adequacy
• The ANOVA approach:
n
iii
n
ii
n
ii yyyyyy
1
2
1
2
1
2 )()()(
SSESSRSST
SSEbSS xyyy
Similar to testing:Ho : 1 = 0Ha : 1 0
Check for Model Significance and Adequacy
The ANOVA table: Sources of Variation
Sum of Squares
Dof Mean Square
Comp. F
Regression SSR 1 SSR SSR/s2
Error SSE n-2 SSE/n-2 Total SST n-1
Reject H0 if comp F > F(1,n-2)
Check for Model Significance and Adequacy• If repeated observations are made at several X values
the SSE term shown previously could be further divided into Error-Lack of Fit and Error-Pure Experimental.
• Computational formula:Y ij = the jth value of the random variable Y i
Y i. = T i. =
jn
jijy
1
i
iinTY ..
kn
yy
kn
sns
n
yys
k
i
n
jiij
k
iii
i
n
jiij
i
j
i
1 1
2.
1
2
2
1
2.
2
)()1(
1
)(
ni = no. of observations at xi
k = no. distinct values of x
s2 = pure experimental error
mean square SSE(pure)
SS(Lack of Fit)= SSE-SSE(pure)
Check for Model Significance and Adequacy
The ANOVA table becomes: Sources of Variation
Sum of Squares
Dof Mean Square (MS)
Comp. F
Regression SSR 1 SSR SSR/s2
Error SSE n-2 Lack of Fit SSE- SSE(pure) k-2 SSE- SSE(pure)
k-2 MS(lack of fit) s2
Pure Error SSE(pure) n-k s2 Total SST n-1 Model significant if Freg > F(1,n-k)
Model adequate if Flack of fit < F(k-2,n-k)
Model Adequacy
• Significant lack of fit means that there is considerable variation being caused by higher-ordered terms- these are terms in x other than the linear or first-order terms.
• Illustration given in Figure 11.11 and 11.12 pp. 378-379 of book
Checking Model Assumptions1. The errors are normally distributed with a mean of zero.
• Construct a normal probability plot (plot of residuals)• If the resulting graph is linear, the normality assumption is verified
• Conduct goodness-of-fit test- chi-sqaure, KS, Shapiro-Wilcoxon• Statistical test on kurtosis and skewness
2. The variance of the error component is the same for each value of X.
• Plot residual against independent variable, X• If no pattern exist, this assumption holds
3. The errors are independent of each other.• Look for autocorrelation
• Plot sample residuals by time – time series analysis
Normal Probability Plot• Compares the cumulative distribution of actual data values
with the cumulative distribution of a normal distribution. (If normal points should fall around the diagonal straight line.)
Deviations from Normality
Statistical Checking for Normality
• Deviations from NormalityKurtosis – refers to the “peakedness” or
“flatness” of the distribution. If normal,kurtosis is zero.
Skewness -deals with the symmetry of the distribution, a skewed variable is a variable whose mean is not in the center of the distribution. If normal, skewness is zero.
Checking for Normality
• Statistical Test for Kurtosis
• Statistical Test for Skewness
N
kurtosisz24
N
skewnessz6
Checking for Normality
• Other ToolsHistogram (good for numerous data points)
Goodness of Fit Tests (good for many data points- 30 or more; but overly sensitive for very large sample- 1000 or more)
Checking Model Assumptions
A: assumption holds
Equal Variance
B: assumption violated
Checking Model Assumptions
Autocorrelation exists, violation of errors being independent of each other
autocorrelation
Importance of Assumptions
• Normality – t-test, F-test.• Homoscedasticity- t-test, F-test, to ensure
variance used in explanation and prediction is distributed across the range of independent variable value
• Absence of Correlated Errors – confidence that prediction errors are independent of the levels at which one is trying to predict, assurance that no other systematic variable is affecting the results and left out of the analysis
On Violation of Assumptions
• One violation can be the result of another. Example: violation of non-normality is linked to or can be the result of non-constant variance.
• A remedy applied to one can solve another.• Remedy available: data or variable
transformation
Notes on Transformation• Two Purposes:
1. Correct violations of statistical assumptions.2. Improve correlation between variables.
• How to Choose?1. Theoretical basis – nature of data (e.g. sqrt transform works well with frequency count data, arcsin transform for proportion data)
2. Trial and Error• Not a magic cure to all violations. Will not
eliminate all violations but could lead to very significant improvements.
Suggested Transforms for Non-normality
Note: Inverse transform usually works well with “flat” distributions.
Suggested Transforms for Heteroscedasticity
• If cone opens to the right – try inverse transform
• If cone opens to the left – try square root transform
Some General Guidelines on Transformation
• For noticeable effect of transformation, ratio of variable mean to std. dev < 4.
• If transformation can be performed on two variables (in non-linearity) , select variable with smallest average to s ratio.
• Transformation should be applied to IVs except for cases of heteroscedasticity. If relationship is heteroscedastic and non-linear there might be a need to transform both IV and DV.
• Transformation may change the interpretation of variables. Be careful!
Suggested Transforms for Non-linearity
In any of the given illustrations, transformation could be carried out on either the independent or dependent variable. When multiple transformation possibilities are shown, start with the top method in each quadrant and move downward until linearity is achieved.
Simple Non-linear Regression by Linearization
Function Proper Transformation
Form of Simple Linear Regression
Exponential: Y=ex
Y* = ln y Regress y* vs x
Power: Y=x
Y* = log y X* = log x
Regress y* against x*
Reciprocal: Y=+(1/X)
X* = 1/X Regress y against x*
Hyperbolic: Y=
xx
Y* = 1/Y X* = 1/X
Regress y8 against x*
Notes on Non-Linear Regression by Linearization
• Model in the transformed variables that has a proper additive error structure is a result of a model in the natural variables with a different type of error structure.
• Performance criteria (s2 and R2) for the transformed model should be based on values of the residuals in the metric of the untransformed response.
Sample Problem to be given in class.
Obtaining The Regression Output •To fit a linear regression using Excel
– Choose Data Analysis, then Regression– Choose the two data columns for which the
Regression is to be calculated– The Y variable will be on the vertical axis– Click residual plot and normal probability plot
to check assumptions
•Statistica could also be used.
Caveats in Simple LR• Linear Model May Be Wrong
– Nonlinear? Unequal variability? Clustering?• Predicting Intervention from Experience is Hard
– Relationship may become different if you intervene• Intercept May Not Be Meaningful
– if there are no data near X = 0• Explaining Y from X vs. Explaining X from Y
– Use care in selecting the Y variable to be predicted• Is there a hidden “Third Factor”?
– Use it to predict better with multiple regression