+ All Categories
Home > Documents > Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to...

Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to...

Date post: 17-Jan-2018
Category:
Upload: opal-wilkerson
View: 216 times
Download: 0 times
Share this document with a friend
Description:
Slide 3 Assumption of Normality The distributions and the tests of normality for the three metric variables are shown below. All three metric variables meet the statistical test for normality. Regression Assumptions and Diagnostic Statistics
64
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions on the diagnostic statistics and plots, both before and after the regression is computed. I created a simulated data set of 100 cases using SPSS's random number generation facility that contains variables with predefined statistical properties. The SPSS syntax file which produces this output can be found at web page for downloading files. The following examples demonstrate what happens when a violation of an underlying regression occurs. In an actual problem, the impact of a violation of an underlying assumption may be more or less severe that the problem simulated here, so that the visible impact on the diagnostic tests will be more or less apparent than shown here. In all of the examples, the violation of the assumption weakens the relationship between the set of independent variables and the dependent variable, and weakens the individual relationship between the individual independent variable and the dependent variable. Furthermore, the impact on the relationship is stronger when the variable violating the assumption is the dependent variable than when the variable violating the assumption is an independent variable. Regression Assumptions and Diagnostic Statistics
Transcript
Page 1: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 1

Regression Assumptions and Diagnostic StatisticsThe purpose of this document is to demonstrate the impact of violations of regression assumptions on the diagnostic statistics and plots, both before and after the regression is computed.I created a simulated data set of 100 cases using SPSS's random number generation facility that contains variables with predefined statistical properties. The SPSS syntax file which produces this output can be found at web page for downloading files.The following examples demonstrate what happens when a violation of an underlying regression occurs.  In an actual problem, the impact of a violation of an underlying assumption may be more or less severe that the problem simulated here, so that the visible impact on the diagnostic tests will be more or less apparent than shown here.In all of the examples, the violation of the assumption weakens the relationship between the set of independent variables and the dependent variable, and weakens the individual relationship between the individual independent variable and the dependent variable.  Furthermore, the impact on the relationship is stronger when the variable violating the assumption is the dependent variable than when the variable violating the assumption is an independent variable.

Regression Assumptions and Diagnostic Statistics

Page 2: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 2

Regression with All Assumptions MetFor the first problem, we will use a normally distributed dependent variable (DV1), two normally distributed independent variables (IV1 and IV2), and a dichotomous independent variable (IV3).

Regression Assumptions and Diagnostic Statistics

Page 3: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 3

Assumption of NormalityThe distributions and the tests of normality for the three metric variables are shown below.  All three metric variables meet the statistical test for normality.

Regression Assumptions and Diagnostic Statistics

Page 4: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 4

Assumption of LinearityThe following scatterplots indicate that we satisfy the linearity assumptions between the metric independent variables and the dependent variable:

Regression Assumptions and Diagnostic Statistics

Page 5: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 5

Assumption of Homogeneity of VarianceFor the dichotomous independent variable, the box plot and homogeneity of variance test meets the assumption of constant variance:

Regression Assumptions and Diagnostic Statistics

Page 6: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 6

Regression ResultsWhen we run standard multiple regression with the three independent variables, we find that there is a very strong relationship between the dependent variables and the set of independent variables.  Furthermore, each of the independent variables has a statistically significant relationship with the dependent variable:

Regression Assumptions and Diagnostic Statistics

Page 7: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 7

Residual AnalysisThe plot of residuals is a null plot, i.e. it contains no pattern of nonlinearity and demonstrates constant variance across the predicted values of the dependent variable.  The normality plot of the residuals supports a conclusion of normality.

The partial plots show no evidence of a nonlinear relationship:

In sum, all of the diagnostic statistics and plots support the conclusion that this analysis meets all of the assumptions for multiple regression.

Regression Assumptions and Diagnostic Statistics

Page 8: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 8

Outliers and Influential CasesNo casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable.  This is verified in the table of residual statistics which shows that the largest standardized residual is 1.870 and the smallest standardized residual is -1.856.

Regression Assumptions and Diagnostic Statistics

Page 9: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 9

Outliers and Influential CasesIn the table of extreme values for the probability of Mahalanobis D², we see that one case, 78, is potentially an outlier on the set of independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 - 3- 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see one case, 74, whose value for Cook's distance is right on the borderline for being considered an influential case.

Regression Assumptions and Diagnostic Statistics

Page 10: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 10

Regression with a Discrete Dependent VariableWe can round the values of the continuous dependent variable (DV1) to create a discrete dependent variable (DV2) that has a limited number of categories.  In the following section, we will examine the impact that a discrete dependent variable has on the analysis. The statistical measures of the distribution of the dependent variable (mean, standard deviation, etc.)  change only slightly.

Regression Assumptions and Diagnostic Statistics

Page 11: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 11

Assumption of NormalityThe distribution of the discrete dependent variable (DV2) looks close to a normal distribution, but fails the normality test. The normality tests for the metric independent variables are not changed.

Regression Assumptions and Diagnostic Statistics

Page 12: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 12

Assumption of LinearityThe discrete dependent variable retains a linear relationship with the independent variables, but the distinctive banding for the limited number of values for a discrete variable is evident.

Regression Assumptions and Diagnostic Statistics

Page 13: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 13

Assumption of Homogeneity of VarianceThe use of the discrete dependent variable did not introduce any problem with homogeneity of variance for the nonmetric variable.

Regression Assumptions and Diagnostic Statistics

Page 14: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 14

Differences in Correlations with the Continuous and Discrete Dependent Variable

In the correlation matrix, we can see that for all three independent variables, the relationship with the discrete dependent variable in the DV2 column is slightly smaller that the relationships with the continuous dependent variable in the DV1 column.

Regression Assumptions and Diagnostic Statistics

Correlations

1.000 .974 .900 .844 .321. .000 .000 .000 .001

100 100 100 100 100.974 1.000 .889 .837 .281.000 . .000 .000 .005100 100 100 100 100

.900 .889 1.000 .867 -.046

.000 .000 . .000 .652100 100 100 100 100

.844 .837 .867 1.000 .001

.000 .000 .000 . .995100 100 100 100 100

.321 .281 -.046 .001 1.000

.001 .005 .652 .995 .100 100 100 100 100

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

DV1 - NormallyDistributed

DV2 - Discrete Normal

IV1 - Normally Distributed

IV2 - Normally Distributed

IV3 - Nonmetric,Homoscedastic

DV1 -Normally

Distributed

DV2 -DiscreteNormal

IV1 -Normally

Distributed

IV2 -Normally

DistributedIV3 - Nonmetric,Homoscedastic

Page 15: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 15

Regression ResultsConsistent with weaker correlations between the discrete dependent variable and the independent variables, the results of the regression analysis in the tables below show that the R² value decreased from 0.951 to 0.906.  Each of the individual independent variables retains its statistically significant relationship to the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 16: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 16

Residual AnalysisThe residual plot shows the impact of the discrete coding for the dependent variable, but otherwise has the same general shape of the null plot for the continuous dependent variable.  The normality plot does not indicate any problem with normality.

Similarly, the partial plots show evidence of the weaker relationships, but otherwise do not demonstrate any departure from linearity.

Regression Assumptions and Diagnostic Statistics

Page 17: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 17

Outliers and Influential CasesNo casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable.  This is verified in the table of residual statistics which shows that the largest standardized residual is 2.427 and the smallest standardized residual is -2.091.

Regression Assumptions and Diagnostic Statistics

Page 18: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 18

Outliers and Influential CasesIn the table of extreme values for the probability of Mahalanobis D², we see that one case, 78, is potentially an outlier on the set of independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 - 3- 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see three cases, 21, 61, and 49, whose value for Cook's distance is right on the borderline for being considered an influential case.

Regression Assumptions and Diagnostic Statistics

Page 19: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 19

Regression with a Skewed Dependent VariableThe following skewed distribution was created by randomly increasing the value of the original dependent (DV1) for about 20 of the cases in the original distribution, creating a new dependent variable DV3. 

Regression Assumptions and Diagnostic Statistics

Page 20: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 20

Assumption of NormalityThe histogram and normality plot show the impact of the skewing in the dependent variable.  As we would expect when we introduce skewness in the variable, the K-S Lilliefors test would support a conclusion of nonnormality.

Regression Assumptions and Diagnostic Statistics

Tests of Normality

.125 100 .001DV3 - Skewed DependentStatistic df Sig.

Kolmogorov-Smirnova

Lilliefors Significance Correctiona.

Page 21: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 21

Assumption of LinearityThe scatterplots of the dependent variable with the metric independent variables retain their linear pattern, though the skewed cases spread upward, away from the rectangular band.

Regression Assumptions and Diagnostic Statistics

Page 22: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 22

Assumption of Homogeneity of VarianceThe box plot for the nonmetric independent variable shows the effects of skewing, i.e. the presence of extreme values, but the boxplot, as well as the homogeneity of variance test, does not indicate a problem with constant variance:

Regression Assumptions and Diagnostic Statistics

Test of Homogeneity of Variance

.045 1 98 .833

.069 1 98 .793

.069 1 96.234 .793

.054 1 98 .817

Based on MeanBased on MedianBased on Median andwith adjusted dfBased on trimmed mean

DV3 - Skewed Dependent

LeveneStatistic df1 df2 Sig.

Page 23: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 23

Differences in Correlations with the Normal and Skewed Dependent Variable

Skewing the dependent variable considerably weakens the relationships with the independent variables as shown in the second column of the correlation matrix:

Regression Assumptions and Diagnostic Statistics

Correlations

1.000 .690 .900 .844 .321. .000 .000 .000 .001

100 100 100 100 100.690 1.000 .659 .633 .134.000 . .000 .000 .183100 100 100 100 100

.900 .659 1.000 .867 -.046

.000 .000 . .000 .652100 100 100 100 100

.844 .633 .867 1.000 .001

.000 .000 .000 . .995100 100 100 100 100

.321 .134 -.046 .001 1.000

.001 .183 .652 .995 .100 100 100 100 100

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

DV1 - NormallyDistributed

DV3 - SkewedDependent

IV1 - NormallyDistributed

IV2 - NormallyDistributed

IV3 - Nonmetric,Homoscedastic

DV1 -Normally

Distributed

DV3 -Skewed

Dependent

IV1 -Normally

Distributed

IV2 -Normally

DistributedIV3 - Nonmetric,Homoscedastic

Page 24: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 24

Regression ResultsSimilarly, the overall relationship between the set of independent variables and the dependent variable declines in strength from .951 to .474, but is still statistically significant.

The relationships between two of the individual independent variables and the dependent variable remain significant.  The individual relationship between the second metric independent variable and the dependent variable is no longer statistically significant.

Regression Assumptions and Diagnostic Statistics

Page 25: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 25

Residual AnalysisThe residual plot shows the funnel shaped pattern associated with heteroscedasticity.  At the left of the plot, the variance of the residuals is very limited, growing larger as we move to the right side of the plot.  Hetereoscedasticity, in this instance, is associated with skewing of the dependent variable.The normal probability plot of the residuals departs substantially from the green line of expected frequencies, indicating that the residuals are not normally distributed.

The partial plots of the dependent variable with the metric independent variables also show the effects of skewing, but do not indicate any nonlinearity.

Regression Assumptions and Diagnostic Statistics

Page 26: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 26

Outliers and Influential CasesThe casewise plot shows the presence of outliers on the dependent variable because of the skewing.  If we examine case 11 in the data editor, we see that the formula for skewing the dependent variable increased the value of the dependent variable from 5.675 to 11.351, making it about three units larger than any other value for the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 27: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 27

Outliers and Influential Cases

In the table of extreme values for the probability of Mahalanobis D², we continue to see that the same case, 78, is potentially an outlier on the set of independent variables.  Thus far we have not changed the values for any independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 - 3- 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042.  Case 11 with the largest value for the dependent variable has the largest Cook's distance measure.  All of these cases had their value doubled to produce the skewing in the distribution, but they were not the only cases in the distribution that had the value of the dependent variable doubled.Regression Assumptions and Diagnostic Statistics

Page 28: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 28

Regression with a Skewed Independent VariableThe next sequence uses the normally distributed dependent variable (DV1) and substitutes a skewed version of the first independent variable (IV1S) for the original normally distributed independent variable (IV1). 

Regression Assumptions and Diagnostic Statistics

Page 29: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 29

Assumption of NormalityThe histogram and the K-S Lilliefors test both indicate non-normality for the new independent variable.

Regression Assumptions and Diagnostic Statistics

Page 30: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 30

Assumption of LinearityThe scatterplot on the left shows the original metric independent variable IV1.  The scatterplot on the right shows the effect of skewing some values of the original IV1 variable. The main band of points is pushed somewhat to the left by the addition of larger skewed values for IV1S.

Regression Assumptions and Diagnostic Statistics

Page 31: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 31

Assumption of Homogeneity of VarianceThe homogeneity of variance assumption is not impacted by the change in the metric independent variable.

Regression Assumptions and Diagnostic Statistics

Page 32: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 32

Differences in Correlations with the Normal and Skewed Independent Variable

The correlation matrix shows that the relationship between IV1S and DV1 (.749) is weaker than the relationship between IV1 and DV1 (.900), which we would expect with a variable that is no longer linear.

Regression Assumptions and Diagnostic Statistics

Page 33: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 33

Regression ResultsThe strength of the relationship measured by R² dropped from .951 to .910 due to the skewed independent variable, though the overall relationship between the independent variables and the dependent variable is still statistically significant.

Each of the individual independent variables retained its statistically significant relationship to the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 34: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 34

Residual AnalysisThe scatterplot of residuals is still a null plot.  The normality plot of the residuals indicates a normal distribution.

The partial plot for the skewed independent variable shows evidence of nonlinearity. When we changed the values for one variable in the relationship to produce the skewing and retained the values of the other variable, we introduced the nonlinearity.  Whether the nonlinearity is evident or not depends on the severity of the change which we made.  If we saw this nonlinear pattern in a partial plot, we might consider a transformation of this independent variable.

Regression Assumptions and Diagnostic Statistics

Page 35: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 35

Outliers and Influential CasesIn this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable.No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable.  This is verified in the table of residual statistics which shows that the largest standardized residual is 2.362 and the smallest standardized residual is -2.040. 

Regression Assumptions and Diagnostic Statistics

Page 36: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 36

Outliers and Influential Cases

In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables.  We can attribute this to skewing one of the variables in the set of independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 – 3 - 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have two cases that have a Cook's distance larger than the criteria of 0.042.  This is one more case than we had with the analysis with all normal variables, but fewer cases than we had when we skewed the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 37: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 37

Regression with a Nonlinear Dependent VariableTo form a nonlinear dependent variable, I took the original dependent variable DV1 and squared it to produced DV5. 

Regression Assumptions and Diagnostic Statistics

Page 38: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 38

Assumption of NormalityThis also has the effect of skewing the variable, as shown in the histogram below.  The skewness produces a distribution that is not normally distributed, as shown in the normality plot and the K-S Lilliefors test.

Regression Assumptions and Diagnostic Statistics

Page 39: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 39

Assumption of LinearityThe scatterplots of the metric independent variables with the nonlinear dependent variable show the nonlinear pattern in the dependent variable.  At both ends of the fit lines, there are points above the line, but not below the line.

Regression Assumptions and Diagnostic Statistics

Page 40: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 40

Assumption of Homogeneity of VarianceWhile some difference in the heights of the bars are visible in the boxplot, the statistical test does not indicate any difference in variance for the two groups on the nonmetric variable when we test the nonlinear dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 41: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 41

Differences in Correlations with the Normal and Nonlinear Dependent Variable

The correlations between the metric independent variable and the nonlinear dependent variable, DV5 in column 3, are smaller that the corresponding correlations with the linear form of the dependent variable, DV1 in column 2, except for the nonmetric variable which had a higher correlation with the nonlinear dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 42: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 42

Regression ResultsThe coefficient of determination between the independent variables and the nonlinear dependent variable declined from the value between the independent variables and the linear dependent variable.  The ANOVA test confirms that this R² is statistically larger than zero.

The statistical tests for the individual coefficients indicated that all were statistically significant.

Regression Assumptions and Diagnostic Statistics

Page 43: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 43

Residual AnalysisThe uncorrected nonlinearity problem of the dependent variable is evident in the residual plot.  There is clearly a nonlinear pattern in the residual plot.   The normality plot would support a conclusion that the residuals are normally distributed.

The partial plots do not reflect the nonlinearity of the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 44: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 44

Outliers and Influential CasesWhen we squared the normally distributed dependent variable to introduce nonlinearity into the variable, the cases with the smallest (case 78) and largest (case 74) values of the dependent variable became outliers in the distribution and had the largest residual values.

Regression Assumptions and Diagnostic Statistics

Page 45: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 45

Outliers and Influential Cases

In this analysis, we again utilized the original form of the independent variables, so we reverted to the circumstances where only a single case was a potential outlier on the combined set of independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 – 3- 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042.  These cases had either the largest or smallest values for the original dependent variable, such that squaring their value to produce the new dependent variable had the largest impact on their position in the distribution.Regression Assumptions and Diagnostic Statistics

Page 46: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 46

Regression with a Nonlinear Independent VariableTo form a nonlinear independent variable named IV1CUBE, I cubed the value of IV1, and entered it into a regression with the dependent variable DV1. 

Regression Assumptions and Diagnostic Statistics

Page 47: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 47

Assumption of NormalityThe histogram, the normality plot, and the K-S Lilliefors test all indicate the lack of normality in the nonlinear variable 'IV1cube'.

Regression Assumptions and Diagnostic Statistics

Page 48: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 48

Assumption of LinearityThe scattergram showing the curvilinear relationship between IV1CUBE and DV1 is shown on the left.  The spread of the points above the center of the fit line is greater than the spread below the center of the fit line.  The linearity of the relation between the normally distributed dependent variable DV1 and the normally distributed independent variable IV2 is unaffected by the change from IV1 to IV1cube.

Regression Assumptions and Diagnostic Statistics

Page 49: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 49

Assumption of Homogeneity of VarianceThe change from IV1 to IV1cube has no effect on the relationship between DV1 and IV3, the nonmetric homogeneous independent variable.

Regression Assumptions and Diagnostic Statistics

Page 50: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 50

Differences in Correlations with the Normal and Nonlinear Independent Variable

The correlation of IV1CUBE with DV1 of .878 is not much smaller than the correlation of IV1 and DV1, suggesting that the curvature of IV1CUBE is slight.

Regression Assumptions and Diagnostic Statistics

Page 51: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 51

Regression ResultsThe change in IV1CUBE was minimal, as we just noted.  Consistent with this observation, our regression statistics are not appreciably different from the model in which IV1 had a linear relationship to the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 52: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 52

Residual AnalysisThe residual plot is very close to a null plot.  There might be a slight curve to the plot associated with the three points in the lower lefthand corner with no points to the right.

The nonlinear pattern in the partial plot for the IV1CUBE variable was expected.  The pattern in the partial plot of DV1 and IV2 is similar to the original partial plot obtained with the linear form of IV1.

Regression Assumptions and Diagnostic Statistics

Page 53: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 53

Outliers and Influential CasesIn this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable.No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable.  This is verified in the table of residual statistics which shows that the largest standardized residual is 1.902 and the smallest standardized residual is -2.562. 

Regression Assumptions and Diagnostic Statistics

Page 54: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 54

Outliers and Influential Cases

In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables.  We can attribute this to the nonlinear variable in the set of independent variables.

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 – 3 - 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042.  The cases with large Cook's distance values have either a very large value for the variable IV1CUBE or a very small value of IV1CUBE, relative to the other cases in the data set.

Regression Assumptions and Diagnostic Statistics

Page 55: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 55

Regression with a Nonmetric Independent Variable with Unequal Subgroup Variance

A new nonmetric independent variable (IV3uneq) was created based on the original nonmetric independent variable (IV3).  The difference between the variables is that some of the subjects were reassigned from one group to the other to make the variance of the two groups heterogeneous. Some of the subjects with higher variance from the mean on the dependent variable DV1 were assigned to group 1 and some of the subjects with lower variance from the mean on the dependent variable DV1 were assigned to group 0. 

Regression Assumptions and Diagnostic Statistics

Page 56: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 56

Assumption of NormalityThe distributions and the tests of normality for the three metric variables are not affected by the change in the nonmetric independent variable.

Regression Assumptions and Diagnostic Statistics

Page 57: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 57

Assumption of LinearityThe check of linearity is not affected by the change in the nonmetric independent variable.

Regression Assumptions and Diagnostic Statistics

Page 58: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 58

Assumption of Homogeneity of VarianceThe results of this change are shown in the following boxplot and test of homogeneity of variance, where the variance in group 1 is much larger than the variance in group 0.

Regression Assumptions and Diagnostic Statistics

Page 59: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 59

Differences in Correlations with the Nonmetric Independent Variable with Homogeneous and Heterogeneous Variance

The correlation between the homogenous version of the IV3 variable and the dependent variable DV1 (.321) is higher than the correlation between IV3UNEQ and the dependent variable DV1 (.073).

Regression Assumptions and Diagnostic Statistics

Page 60: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 60

Regression ResultsThe strength of the overall relationship between the dependent variable and the set of independent variables declined from an R² of .951 to an R² of .876, consistent with the decrease in the correlation for IV3UNEQ and DV1.   All of the independent variables retained their individual relationship with the dependent variable.

Regression Assumptions and Diagnostic Statistics

Page 61: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 61

Residual AnalysisThe residual plot looks very much like a null plot and the normality plot would support a conclusion of a normal distribution.

The residual plots for both metric variables do not show any pattern of nonlinearity.

Regression Assumptions and Diagnostic Statistics

Page 62: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 62

Outliers and Influential CasesIn this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable.No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable.  This is verified in the table of residual statistics which shows that the largest standardized residual is 2.985 and the smallest standardized residual is -2.702. 

Regression Assumptions and Diagnostic Statistics

Page 63: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 63

Outliers and Influential Cases

In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables. 

Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables.  For this problem the criteria is: 4/(100 – 3 - 1) = 0.042.Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042.  The method used to change the variance of the two groups on the IV3 variable, reassigning cases in the tails of the distribution of the dependent variable to a different group, contributed to the presence of influential cases.

Regression Assumptions and Diagnostic Statistics

Page 64: Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Slide 64

Summary Table The following table summarizes the changes that we have seen in our diagnostic plots and statistics with each change we have made to the dependent or independent variables:


Recommended