Multiple Regression Fitting Models for Multiple Independent VariablesBy Ellen Ludlow
If you wanted to predict someones weight based on their height, you would collect data by recording the height and weight and fit a model.
If you wanted to predict someones weight based on their height, you would collect data by recording the height and weight and fit a model.Lets say our population are males ages 16-25, and this is a table of collected data...
If you wanted to predict someones weight based on their height, you would collect data by recording the height and weight and fit a model.Lets say our population are males ages 16-25, and this is a table of collected data...
Sheet4
Sheet1
1606825
1587019
1556718
1546518
1486417
1466623
1446618
1446318
1406420
1376217
1356616
1336419
1306616
1286418
height606365666768686970707172727375
weight1203513014337149144150156152154162169163168
age25191817231818201716191618
Sheet2
Sheet3
Next, we graph the data..
Next, we graph the data..
Sheet4
Chart1
Height (ins)
Weight (lbs)
Weight vs. Height
Chart2
60
63
65
66
67
68
68
69
70
70
71
72
72
73
75
Height (ins)
Weight (lbs)
Height vs Weight
Sheet1
60120
63135
65130
66143
67137
68149
68144
69150
70156
70152
71154
72162
72169
73163
75168
120135130143137149144150156152154162169163168
height606365666768686970707172727375
weight120135130143137149144150156152154162169163168
age25191817231818201716191618
Sheet2
Sheet3
Next, we graph the data..And because the data looks linear, fit an LSR line
Sheet4
Chart1
Height (ins)
Weight (lbs)
Weight vs. Height
Chart2
60
63
65
66
67
68
68
69
70
70
71
72
72
73
75
Height (ins)
Weight (lbs)
Height vs Weight
Sheet1
60120
63135
65130
66143
67137
68149
68144
69150
70156
70152
71154
72162
72169
73163
75168
120135130143137149144150156152154162169163168
height606365666768686970707172727375
weight120135130143137149144150156152154162169163168
age25191817231818201716191618
Sheet2
Sheet3
Next, we graph the data..And because the data looks linear, fit an LSR line
Sheet4
Chart1
Height (ins)
Weight (lbs)
Weight vs. Height
Chart2
60
63
65
66
67
68
68
69
70
70
71
72
72
73
75
Height (ins)
Weight (lbs)
Height vs Weight
Sheet1
60120
63135
65130
66143
67137
68149
68144
69150
70156
70152
71154
72162
72169
73163
75168
120135130143137149144150156152154162169163168
height606365666768686970707172727375
weight120135130143137149144150156152154162169163168
age25191817231818201716191618
Sheet2
Sheet3
But weight isnt the only factor that has an impact on someones height. The height of someones parents may be another predictor. With multiple regression you may have more then one independent variable, so you could use someone's weight and his parents height to predict his own height.
Our new table, with the data, the average height of a subjects parents, looks like this
Sheet4
Sheet1
1606825
1587019
1556718
1546518
1486417
1466623
1446618
1446318
1406420
1376217
1356616
1336419
1306616
1286418
height606365666768686970707172727375
weight1203513014337149144150156152154162169163168
parent's height596762597166716769736975726973
Sheet2
Sheet3
This data cant be graphed like simple linear regression, because there are two independent variables.
This data cant be graphed like simple linear regression, because there are two independent variables.There is software, however, such as Minitab, that can analyze data with multiple independent variable.Lets take a look at a Minitab output for our data
What does all this mean?Predictor Coef Stdev t-ratio pConstant 25.028 4.326 5.79 0.000weight 0.24020 0.03140 7.65 0.000parenth 0.11493 0.09035 1.27 0.227s = 1.165 R-sq = 92.6% R-sq(adj) = 91.4%Analysis of VarianceSOURCE DF SS MS F pRegression 2 205.31 102.65 75.62 0.000Error 12 16.29 1.36Total 14 221.60
First, Lets look at the multiple regression modelSimple linear regression model:Multiple regression model:The general model for multiple regression is similar to the model for simple linear regression.
Just like linear regression, when you fit a multiple regression to data, the terms in the model equation are statistics not parameters. where k is the number of independent variables.A multiple regression model using statistical notation looks like...
The multiple regression model for our data isPredictor Coef Stdev t-ratio pConstant 25.028 4.326 5.79 0.000weight 0.24020 0.03140 7.65 0.000parenth 0.11493 0.09035 1.27 0.227 We get the coefficient values from the Minitab output
Once the regression is fitted, we need to know how well the model fits the dataFirst, we check and see if there is a good overall fit.
Then, we test the significance of each independent variable. You will notice that this is the same way we test for significance in a simple linear regression.
The Overall Test
Hypotheses:
The Overall Test
All independent variables are unimportant for predicting yHypotheses:
The Overall Test
All independent variables are unimportant for predicting yAt least one independent variable is useful for predicting yAt least oneHypotheses:
How do you calculate the F-statistic?
How do you calculate the F-statistic?It can easily be found in the Minitab output, along with the p-value
How do you calculate the F-statistic?It can easily be found in the Minitab output, along with the p-valueSOURCE DF SS MS F pRegress 2 205.31 102.65 75.62 0.000Error 12 16.29 1.36Total 14 221.60 Or you can calculate it by hand
But, before you can calculate the F-statistic, you need to be introduced to some other terms.
But, before you can calculate the F-statistic, you need to be introduced to some other terms.Regression sum of squares (regression SS) - the variation in Y accounted for by the regression model with respect to the mean model
But, before you can calculate the F-statistic, you need to be introduced to some other terms.Regression sum of squares (regression SS) - the variation in Y accounted for by the regression model with respect to the mean modelError sum of squares (error SS) - the variation in Y not accounted for by the regression model.
But, before you can calculate the F-statistic, you need to be introduced to some other terms.Regression sum of squares (regression SS) - the variation in Y accounted for by the regression model with respect to the mean modelError sum of squares (error SS) - the variation in Y not accounted for by the regression model.Total sum of squares (total SS) - the total variation in Y
Now that we understand these terms we need to know how to calculate them
Regression SSError SSTotal SSNow that we understand these terms we need to know how to calculate themTotal SS = Regression SS + Error SS
There are also regression mean of squares, error mean of squares, and total mean of squares (abbreviated MS).
There are also regression mean of squares, error mean of squares, and total mean of squares (abbreviated MS).To calculate these terms, you divide the sum of squares by its respective degrees of freedom
There are also regression mean of squares, error mean of squares, and total mean of squares (abbreviated MS).To calculate these terms, you divide the sum of squares by its respective degrees of freedomRegression d.f. = kError d.f. = n-k-1Total d.f. = n-1
There are also regression mean of squares, error mean of squares, and total mean of squares (abbreviated MS).To calculate these terms, you divide the sum of squares by its respective degrees of freedomRegression d.f. = kError d.f. = n-k-1Total d.f. = n-1Where k is the number of independent variables and n is the total number of observations used to calculate the regression
SoRegression MS
Error MS
Total MSand Regression MS + Error MS = Total MS
Both sum of squares and mean squares values can be found in Minitab
Both sum of squares and mean squares values can be found in MinitabSOURCE DF SS MS F pRegress 2 205.31 102.65 75.62 0.000Error 12 16.29 1.36Total 14 221.60
Both sum of squares and mean squares values can be found in MinitabNow we can calculate the F-statistic.SOURCE DF SS MS F pRegress 2 205.31 102.65 75.62 0.000Error 12 16.29 1.36Total 14 221.60
Test Statistic and DistributionTest statistic:
F= model mean square error mean squareF= 102.65 1.36F= 75.48
Which is very close to F-statistic from Minitab ( 75.62)
The p-value for the F-statistic is then found in a F-Distribution Table. As you saw before, it can also be easily calculated by software.
A small p-value rejects the null hypothesis that none of the independent variables are significant. That is to say, at least one of the independent variables are significant.
The conclusion in the context of our data is:We have strong evidence (p is approx. 0) to reject the null hypothesis. That is to say either someones weight or their average parents height is significant in predicting his height.Once you know that at least one independent variable is significant, you can go on to test each independent variable separately.
Testing Individual TermsIf an independent variable does not contribute significantly to predicting the value of Y, the coefficient of that variable will be 0. The test of the these hypotheses determines whether the estimated coefficient is significantly different from 0. From this, we can tell whether an independent variable is important for predicting the dependent variable.
Test for Individual Terms:
Test for Individual Terms:
HO:
Test for Individual Terms:
HO: The independent variable, xj, is not important for predicting y
Test for Individual Terms:
HO: The independent variable, xj, is not important for predicting y
HA:
Test for Individual Terms:
HO: The independent variable, xj, is not important for predicting y
HA:The independent variable, xj, is important for predicting y
Test for Individual Terms:
HO: The independent variable, xj, is not important for predicting y
HA:The independent variable, xj, is important for predicting y
where j represents a specified random variable
Test Statistic:
t=
Test Statistic:
t=
d.f. = n-k-1
Test Statistic:
t=
d.f. = n-k-1
Remember, this test is only to be performed, if the overall model of the test is significant.
T-distributionTests of individual terms for significance are the same as a test of significance in simple linear regression
A small p-value means that the independent variable is significant.Predictor Coef Stdev t-ratio pConstant 25.028 4.326 5.79 0.000weight 0.24020 0.03140 7.65 0.000parenth 0.11493 0.09035 1.27 0.227 This test of significance shows that weight is a significant independent variable for predicting height, but average parent height is not.
Now that you know how to do tests of significance for multiple regression, there are many other things that you can learn. Such asHow to create confidence intervalsHow to use categorical variables in multiple regressionHow to test for significance in groups of independent variables