Regression In Excel 1
Contents
Regression Model
Regression Analysis in Excel
Simple Linear Regression
Correlation
How To Do A Regression in Excel
Slope
Intercept
ANOVA
References
2
Regression Model
A multiple regression model is:
y = β1+ β2 x2+ β3 x3+ u
Such that:
y is dependent variable
x2 and x3 are independent variables
β1 is constant
β2 and β3 are regression coefficients
It is assumed that the error u is independent with constant variance.
We wish to estimate the regression line:
y = b1 + b2 x2 + b3 x3
3
Regression Analysis in Excel
We do this using the Data analysis Add-in and Regression.
Example:
4
5
Contd…..
Contd…..
The regression output has three components: Regression statistics table
ANOVA table
Regression coefficients table.
6
Interpreting Regression Statistics Table Regression Statistics
The standard error here refers to the estimated standard deviation of the error term u.
It is sometimes called the standard error of the regression. It equals sqrt(SSE/(n-k)).
It is not to be confused with the standard error of y itself (from descriptive statistics) or with the standard errors of the regression coefficients given below.
R2 = 0.8025 means that 80.25% of the variation of yi around its mean is explained by the regressors x2i and x3i.
7
Contd…..
The regression output of most interest is the following table of coefficients and associated output:
8
Contd….
Let βj denote the population coefficient of the jth regressor (intercept, HH SIZE and CUBED HH SIZE). Then
Column "Coefficient" gives the least squares estimates of βj.
Column "Standard error" gives the standard errors (i.e.the estimated standard deviation) of the least squares estimates bj of βj.
Column "t Stat" gives the computed t-statistic for H0: βj = 0 against Ha: βj ≠ 0. This is the coefficient divided by the standard error. It is compared to a t with (n-k) degrees of freedom where here n = 5 and k = 3.
Column "P-value" gives the p-value for test of H0: βj = 0 against Ha: βj ≠ 0.. This equals the Pr{|t| > t-Stat}where t is a t-distributed random variable with n-k degrees of freedom and t-Stat is the computed value of the t-statistic given in the previous column. Note that this p-value is for a two-sided test. For a one-sided test divide this p-value by 2 (also checking the sign of the t-Stat).
Columns "Lower 95%” and "Upper 95%” values define a 95% confidence interval for βj.
9
Contd……
A simple summary of the previous output is that the fitted line is:
y = 0.8966 + 0.3365x + 0.0021z
10
11
Regression and Correlation
Techniques that are used to establish whether there is a mathematical relationship between
two or more variables, so that the behavior of one variable can be used to predict the
behavior of others. Applicable to “Variables” data only.
• “Regression” provides a functional relationship (Y=f(x)) between the variables; the
function represents the “average” relationship.
• “Correlation” tells us the direction and the strength of the relationship.
The analysis starts with a Scatter Plot of Y vs X.
The analysis starts with a Scatter Plot of Y vs. X
12
Simple Linear Regression
What is it?
Determines if Y
depends on X and
provides a math
equation for the
relationship
(continuous data)
Examples:
Process conditions
and product properties
Sales and advertising
budget
y
x
Does Y depend on X?
Which line is correct?
13
Simple Linear Regression
b = Y intercept
= the Y value
at point that
the line
intersects Y
axis.
m = slope = rise
run
Y
X 0
b
rise
run
A simple linear relationship can be described mathematically by
Y = mX + b
Simple Linear Regression
Y
X
0 10 5
5
0
rise
run
slope = rise
run =
(6 - 3)
(10 - 4)
= 1
2
intercept = 1
Y = 0.5X + 1
14
15
Simple Regression Example
An agent for a residential real estate company in a large city would like to predict the monthly rental cost for apartments based on the size of the apartment as defined by square footage. A sample of 25 apartments in a particular residential neighborhood was selected to gather the information
16
Size Rent
850 950
1450 1600
1085 1200
1232 1500
718 950
1485 1700
1136 1650
726 935
700 875
956 1150
1100 1400
1285 1650
1985 2300
1369 1800
1175 1400
1225 1450
1245 1100
1259 1700
1150 1200
896 1150
1361 1600
1040 1650
755 1200
1000 800
1200 1750
The data on size and rent for the 25 apartments will be
analyzed in EXCEL.
17
Scatter Plot
500
700
900
1100
1300
1500
1700
1900
2100
2300
2500
500 700 900 1100 1300 1500 1700 1900 2100
Size
Ren
t
Scatter plot suggests that there is a ‘linear’ relationship between Rent and Size
18
Interpreting EXCEL output
Regression Equation
Rent = 177.121+1.065*Size
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
19
Interpretation of the Regression Coefficient
What does the coefficient of Size mean?
For every additional square feet,
Rent goes up by $1.065
20
Using Regression for Prediction
Predict monthly rent when apartment size is 1000 square feet:
Regression Equation:
Rent = 177.121+1.065*Size
Thus, when Size=1000
Rent=177.121+1.065*1000=$1242 (rounded)
21
Using Regression for Prediction – Caution!
Regression equation is valid only over the range over which it was estimated! We should interpolate
Do not use the equation in predicting Y when X values are
not within the range of data used to develop the equation. Extrapolation can be risky
Thus, we should not use the equation to predict rent for an
apartment whose size is 500 square feet, since this value is not in the range of size values used to create the regression equation.
22
2.5 4.0
Sample
Data
True
Relationship
Why Extrapolation is Risky
In this figure, we fit our regression model using sample data – but the linear relation implicit in our regression model does not hold outside our sample! By extrapolating, we are making erroneous estimates!
Extrapolated relationship
23
Correlation (r)
“Correlation coefficient”, r, is a measure of the strength and the direction of the relationship between two variables. Values of r range from +1 (very strong direct relationship), through “0” (no relationship), to –1 (very strong inverse relationship). It measures the degree of scatter of the points around the “Least Squares” regression line.
24
Coefficient of Correlation from EXCEL
The sign of r is the same as that of the coefficient of X (Size) in the regression equation (in our case the sign is positive). Also, if you look at the scatter plot, you will note that the sign should be positive.
R=0.85 suggests a fairly ‘strong’ correlation between size and rent.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
25
Coefficient of Determination (r2)
“Coefficient of Determination”, r-squared, (sometimes R- squared), defines the amount of the variation in Y that is attributable to variation in X
26
Getting r2 from EXCEL
It is important to remember that r-squared is always positive. It is the square of the coefficient of correlation r. In our case, r2=0.72 suggests that 72% of variation in Rent is explained by the variation in Size. The higher the value of r2, the better is the simple regression model.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
27
Standard Error (SE)
Standard error measures the variability or scatter of the observed values around the regression line.
500
700
900
1100
1300
1500
1700
1900
2100
500 1000 1500 2000 2500
Size (square feet)
Ren
t ($
)
28
Getting the Standard Error (SE) from EXCEL
In our example, the standard error associated with estimating rent is $194.60.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
29
Is the Simple Regression Model Statistically Valid?
It is important to test whether the regression model developed from sample data is statistically valid.
For simple regression, we can use 2 approaches to test whether the coefficient of X is equal to zero
1. using t-test
2. using ANOVA
30
Is the coefficient of X equal to zero?
In both cases, the hypothesis we test is:
0Slope:H
0Slope:H
1
0
What could we say about the linear relationship between X and Y if the slope were zero?
31
Using coefficient information for testing if slope=0
t-stat=7.740 and P-value=7.52E-08. P-value is very small. If it is smaller than our a level, then, we reject null; not otherwise. If a=0.05, we would reject null and conclude that slope is not zero. Same result holds at a=0.01 because the P-value is smaller than 0.01. Thus, at 0.05 (or 0.01) level, we conclude that the slope is NOT zero implying that our model is statistically valid.
P-value
7.52E-08
=7.52*10-8
=0.0000000752
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
32
Using ANOVA for testing if slope=0 in EXCEL
F=59.91376 and P-value=7.51833E-08. P-value is again very small. If it is smaller than our a level, then, we reject null; not otherwise. Thus, at 0.05 (or 0.01) level, slope is NOT zero implying that our model is statistically valid. This is the same conclusion we reached using the t-test.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
33
Confidence Interval for the Slope of Size
The 95% CI tells us that for every 1 square feet increase in apartment Size, Rent will increase by $0.78 to $1.35.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.71
Standard Error 194.60
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08
Residual 23 870949.4547 37867.3676
Total 24 3139726
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184
Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350
34
How To Do A Regression In Excel
1. Enter the data into a spreadsheet 35
2. Tools/DataAnalysis/Regression 36
3. Enter the dependent variable in the “y” column and the
independent variable (or variables) in the “x” columns 37
4. Indicate where output should go (the 1st cell in the range
works)
38
5. The basic regression is done. (You may need to widen
columns)
39
1. Construct Scatterplot to see if data looks linear. Click Chart Wizard.
40
Select Scatter and
this chart sub-type
41
Highlight only cells that contain the x and y values
42
Enter a chart title Label x and y axes
43
Store on a new worksheet Name the worksheet
44
1. Click on grey background -- Delete 2. Click on any horizontal line -- Delete 3. Click on legend -- Delete
45
46
Right mouse click on any data point. Select “Add Trendline”.
Select Linear from the trendline options.
47
Looks linear. Return to Original Worksheet.
48
Go to Tools Menu Select Data Analysis
49
Select Regression
50
1. Highlight cells of y-variable
2. Highlight cells of x-variable
3. Check Labels (if first row has labels) 4. Check Confidence Level -- for other than 95% intervals (and change %) 5. Click New Worksheet Ply and give name
51
Low p-value Linear model OK
52.56757x46486.49y
r r2
adj r2 s
n
SSR SSE
SSTOTAL
95% Confidence interval for 1
99% Confidence interval for 1
52
Slope
Returns the slope of the linear regression line through data points in known_y's and known_x's. The slope is the vertical distance divided by the horizontal distance between any two points on the line, which is the rate of change along the regression line.
Syntax
SLOPE(known_y's,known_x's)
Known_y's is an array or cell range of numeric dependent data points.
Known_x's is the set of independent data points.
53
INTERCEPT
Calculates the point at which a line will intersect the y-axis by using existing x-values and y-values. The intercept point is based on a best-fit regression line plotted through the known x-values and known y-values. Use the INTERCEPT function when you want to determine the value of the dependent variable when the independent variable is 0 (zero). For example, you can use the INTERCEPT function to predict a metal's electrical resistance at 0°C when your data points were taken at room temperature and higher.
Syntax
INTERCEPT(known_y's,known_x's)
Known_y's is the dependent set of observations or data.
Known_x's is the independent set of observations or data.
54
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative Main Question: Do the (means of) the quantitative variables depend on which group (given by categorical variable) the individual is in? If categorical variable has only 2 values:
• 2-sample t-test
ANOVA allows for 3 or more groups
55
An example ANOVA situation
Subjects: 25 patients with blisters Treatments: Treatment A, Treatment B, Placebo Measurement: # of days until blisters heal Data [and means]:
• A: 5,6,6,7,7,8,9,10 [7.25] • B: 7,7,8,9,9,10,10,11 [8.875] • P: 7,9,9,10,10,10,11,12,13 [10.11]
Are these differences significant?
56
Informal Investigation
Graphical investigation: • side-by-side box plots • multiple histograms
Whether the differences between the groups are significant depends on
• the difference in the means • the standard deviations of each group • the sample sizes
ANOVA determines P-value from the F statistic
57
Side by Side Boxplots
PBA
13
12
11
10
9
8
7
6
5
treatment
days
58
What does ANOVA do?
At its simplest (there are extensions) ANOVA tests the following hypotheses:
H0: The means of all the groups are equal.
Ha: Not all the means are equal doesn’t say how or which ones differ.
Can follow up with “multiple comparisons”
Note: we usually refer to the sub-populations as “groups” when doing ANOVA.
59
Assumptions of ANOVA
Each group is approximately normal
check this by looking at histograms and/or normal quantile plots, or use assumptions
can handle some non-normality, but not severe outliers
Standard deviations of each group are approximately equal
rule of thumb: ratio of largest to smallest sample st. dev. must be less than 2:1
60
Normality Check
We should check for normality using: • assumptions about population • histograms for each group • normal quantile plot for each group
With such small data sets, there really isn’t a really good way to check normality from data, but we make the common assumption that physical measurements of people tend to be normally distributed.
61
Standard Deviation Check
Compare largest and smallest standard deviations: • largest: 1.764 • smallest: 1.458 • 1.458 x 2 = 2.916 > 1.764
Note: variance ratio of 4:1 is equivalent.
Variable treatment N Mean Median StDev
days A 8 7.250 7.000 1.669
B 8 8.875 9.000 1.458
P 9 10.111 10.000 1.764
62
Notation for ANOVA
• n = number of individuals all together • I = number of groups • = mean for entire data set is
Group i has
• ni = # of individuals in group i • xij = value for individual j in group i • = mean for group i • si = standard deviation for group i
ix
x
63
How ANOVA works (outline)
ANOVA measures two sources of variation in the data and compares their relative sizes
• variation BETWEEN groups • for each data value look at the difference between its group mean and the overall mean
• variation WITHIN groups • for each data value we look at the difference between that value and the mean of its group 2iij xx
2xxi
64
The ANOVA F-statistic is a ratio of the Between Group Variaton divided by the Within Group Variation:
MSE
MSG
Within
BetweenF
A large F is evidence against H0, since it indicates that there is more difference between groups than within groups.
65
How are These Computations Made?
We want to measure the amount of variation due to BETWEEN group variation and WITHIN group variation For each data value, we calculate its contribution to:
• BETWEEN group variation:
• WITHIN group variation:
x i x 2
2)( iij xx
66
An Even Smaller Example
Suppose we have three groups • Group 1: 5.3, 6.0, 6.7 • Group 2: 5.5, 6.2, 6.4, 5.7 • Group 3: 7.5, 7.2, 7.9
We get the following statistics:
SUMMARY
Groups Count Sum Average Variance
Column 1 3 18 6 0.49
Column 2 4 23.8 5.95 0.176667
Column 3 3 22.6 7.533333 0.123333
67
Excel ANOVA Output
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952
Total 6.884 9
1 less than number of groups
number of data values - number of groups (equals df for each group added together) 1 less than number of individuals
(just like other situations)
68
Computing ANOVA F statistic
WITHIN BETWEEN
difference: difference
group data - group mean group mean - overall mean
data group mean plain squared plain squared
5.3 1 6.00 -0.70 0.490 -0.4 0.194
6.0 1 6.00 0.00 0.000 -0.4 0.194
6.7 1 6.00 0.70 0.490 -0.4 0.194
5.5 2 5.95 -0.45 0.203 -0.5 0.240
6.2 2 5.95 0.25 0.063 -0.5 0.240
6.4 2 5.95 0.45 0.203 -0.5 0.240
5.7 2 5.95 -0.25 0.063 -0.5 0.240
7.5 3 7.53 -0.03 0.001 1.1 1.188
7.2 3 7.53 -0.33 0.109 1.1 1.188
7.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106
TOTAL/df 0.25095714 2.55275
overall mean: 6.44 F = 2.5528/0.25025 = 10.21575
69
ANOVA Output
1 less than # of groups
# of data values - # of groups (equals df for each group added together)
1 less than # of individuals (just like other situations)
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
70
ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
2)( i
obs
ij xx
(x iobs
x )2
(xij
obs
x )2
SS stands for sum of squares • ANOVA splits this into 3 parts
71
ANOVA Output
MSG = SSG / DFG MSE = SSE / DFE
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
F = MSG / MSE
P-value comes from F(DFG,DFE)
(P-values for the F statistic are in Table E)
72
So How Big is F?
Since F is
Mean Square Between / Mean Square Within = MSG / MSE
A large value of F indicates relatively more difference between groups than within groups (evidence against H0)
To get the P-value, we compare to F(I-1,n-I)-distribution • I-1 degrees of freedom in numerator (# groups -1) • n - I degrees of freedom in denominator (rest of df)
73
Connections between SST, MST, and Standard Deviation
So SST = (n -1) s2, and MST = s2. That is, SST and MST measure the TOTAL variation in the data set.
s2 x ij x
2
n 1
SST
DFT MST
If ignore the groups for a moment and just compute the standard deviation of the entire data set, we see
74
Connections between SSE, MSE, and Standard Deviation
So SS[Within Group i] = (si2) (dfi )
ii
iij
idf
iSS
n
xxs
]Group Within[
1
2
2
This means that we can compute SSE from the standard deviations and sizes (df) of each group:
)()1(
] [][
22
iiii dfsns
iGroup WithinSSWithinSSSSE
Remember:
75
Pooled Estimate for Standard Deviation
sp
2 (n1 1)s1
2 (n2 1)s22 ... (nI 1)sI
2
n I
sp
2 (df1)s1
2 (df2)s22 ... (df I )sI
2
df1 df2 ... df I
One of the ANOVA assumptions is that all groups have the same standard deviation. We can estimate this with a weighted average:
MSEDFE
SSEsp 2
so MSE is the pooled estimate of variance
76
In Summary
SST (x ij xobs
)2 s2(DFT)
SSE (x ij x i)2
obs
si
2
groups
(df i)
SSG (x i
obs
x)2 ni(x i x)2
groups
SSE SSG SST; MS SS
DF; F
MSG
MSE
77
R2 Statistic
SST
SSG
TotalSS
BetweenSSR
][
][2
R2 gives the percent of variance due to between group variation
We will see R2 again when we study regression.
78
Where’s the Difference?
Analysis of Variance for days
Source DF SS MS F P
treatmen 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ----------+---------+---------+------
A 8 7.250 1.669 (-------*-------)
B 8 8.875 1.458 (-------*-------)
P 9 10.111 1.764 (------*-------)
----------+---------+---------+------
Pooled StDev = 1.641 7.5 9.0 10.5
Once ANOVA indicates that the groups do not all appear to have the same means, what do we do?
Clearest difference: P is worse than A (CI’s don’t overlap)
79
Multiple Comparisons
Once ANOVA indicates that the groups do not all have the same means, we can compare them two by two using the 2-sample t test
• We need to adjust our p-value threshold because we are doing multiple tests with the same data.
•There are several methods for doing this.
• If we really just want to test the difference between one pair of treatments, we should set the study up that way.
80
Tuckey’s Pairwise Comparisons
Tukey's pairwise comparisons
Family error rate = 0.0500
Individual error rate = 0.0199
Critical value = 3.55
Intervals for (column level mean) - (row level mean)
A B
B -3.685
0.435
P -4.863 -3.238
-0.859 0.766
95% confidence
Use alpha = 0.0199 for each test.
These give 98.01% CI’s for each pairwise difference. Only P vs A is significant (both values have same sign)
98% CI for A-P is (-0.86,-4.86)
81
Tukey’s Method in R
Tukey multiple comparisons of means 95% family-wise confidence level
diff lwr upr
B-A 1.6250 -0.43650 3.6865
P-A 2.8611 0.85769 4.8645
P-B 1.2361 -0.76731 3.2395
82
Forecasting: Basic Time Series Decomposition in Excel
Forecast method 1 – Guess
Forecast method 2 – Linear Regression
Forecast method 3 – Time Series Decomposition (TSD)
83
References
http://www.wikihow.com/Run-Regression-Analysis-in-Microsoft-Excel
http://office.microsoft.com/en-001/excel-help/slope-HP005209264.aspx
http://office.microsoft.com/en-in/excel-help/intercept-HP005209143.aspx
http://capacitas.wordpress.com/2013/01/14/forecasting-basic-time-series-decomposition-in-excel/
84
THANK YOU
85