Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | helena-white |
View: | 221 times |
Download: | 0 times |
04/21/23 Slide 1
Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square tests.
In regression, the null hypothesis is that there is no relationship between the dependent and independent variables. When there is no relationship, the predicted values for the dependent variable are the same for all values of the independent variable. In order for this to happen, the slope in the regression equation would have to be zero, i.e. estimated dependent variable = intercept + 0 x independent variable. The value for the independent variable would be multiplied by zero and would not change. The null hypothesis of no relationship translates to the slope = 0, or b = 0. Without a relationship, our best estimate of the value of the dependent variable is the mean of the dependent variable (best = smallest total error).
The alternative hypothesis is that there is a relationship, i.e. knowing the value of the independent variable helps us do a more accurate job of predicting values of the dependent variable (more accurate = less total error)
If we reject the null hypothesis, we interpret the strength and direction of the relationship for the population represented by the sample. If we fail to reject the null hypothesis, we find that the data does not support the research hypothesis.
04/21/23 Slide 2
To test the inference in linear regression, we are required to satisfy the conditions stated for linear regression (linearity, equal variance of the residuals, and an absence of outliers).
In addition, to use the normal distribution to accurately compute probabilities for the statistical test, the distribution of the residuals must be normal. Support for the normality of the residuals mirrors the criteria used for the normality of the dependent variable in t-tests – the variables are normally distributed, or if they are not, the sample size is large enough to apply the Central Limit theorem.
Since it is difficult to accurately evaluate the scatterplots to support equality of variance and normality of residuals, we introduce the use of diagnostic statistical tests which provide the same numeric criteria for making decisions we use in hypothesis tests.
We will use the Breusch-Pagan test for evaluating equality of variance for the residuals and the Shapiro-Wilk test for normality.
Diagnostic tests have a null hypothesis that the data meets the condition we are testing for, e.g. equality of variance or normality. Rejection of the null hypothesis implies that the condition is not satisfied.
.
04/21/23 Slide 3
Our objective in these tests is to fail to reject the null hypothesis. i. e. conclude that the variance in the residuals is uniform or the residuals are normally distributed. The goal is, thus, opposite to what we hope to find in testing regular hypothesis tests. Our purpose is to assess or diagnose our data rather than to make inferences about the population.
SPSS computes the Shapiro-Wilk test, but does not compute the Breusch-Pagan test.
The script for Simple Linear Regression has been modified to include the Breusch-Pagan test in a table of statistics for homoscedasticity. The modified script file is named SimpleLinearRegressionInferenceTest.SBS and is available on the course web site.
Due to the difficulties in running scripts, I have also provided a syntax file that computes the Breusch-Pagan statistic and probability. Syntax files do not usually have the same problems running on different versions of SPSS that we experience with script files, but their solutions are more cumbersome. Demonstration of the syntax file is included in this tutorial. The syntax file is named BreuschPaganSyntax.sps, available on the course web site.
There is an SPSS macro on the web for computing Breusch-Pagan, but I find that it does not produce correct answers (or at least the answers in SAS and R).
While I would usually set a more conservative alpha of 0.01 for diagnostic tests to make sure we only respond to serious violations, we will use 0.05 for this week’s problems.
04/21/23 Slide 4
The introductory statement in the question indicates:• The data set to use (world2007.sav)• The task to accomplish (a regression slope t-test )• The variables to use in the analysis: the
independent variable slum population as percentage of urban population [slumpct] and the dependent variable infant mortality rate [infmort].
• The alpha level of significance for the hypothesis test: 0.05
• The criteria for evaluating strength: Cohen’s criteria
04/21/23 Slide 5
These problem also contain a second paragraph of instructions that provide the formulas to use if the analysis requires us to re-express or transform the variable to satisfy the conditions for linear regression.
04/21/23 Slide 6
The first statement asks about the level of measurement. The t-test of a regression slope requires that both the dependent variable and the independent variable be quantitative.
04/21/23 Slide 7
Since both the independent variable slum population as percentage of urban population [slumpct] and the dependent variable infant mortality rate [infmort] are quantitative, we mark the check box for a correct answer.
04/21/23 Slide 8
The first statement asks about the size of the sample. To answer this question, we run the linear regression in SPSS.
04/21/23 Slide 9
To compute a simple linear regression, select Regression> Linear from the Analyze menu.
04/21/23 Slide 10
First, move the dependent variable, infmort, to the Dependent text box.
Second, move the independent variable, slumpct, to the Independent(s) list box.
Third, click on the Statistics button to request basic descriptive statistics.
04/21/23 Slide 11
First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.
Third, click on the Continue button to close the dialog box.
Second, click on the Casewise diagnostics check box to produce the table with information about outliers and influential cases.
04/21/23 Slide 12
Next, click on the Plots button to request the residual plot.
04/21/23 Slide 13
Second, move *ZPRED (for standardized predictions) to the Y axis text box.
First, move *ZRESID (for standardized residuals) to the Y axis text box.
Fourth, click on the Continue button to close the dialog box.
Third, mark the check box for a histogram and a normal probability plot of the residuals.
04/21/23 Slide 14
Next, click on the Save button to include Cooks distance in the output.
04/21/23 Slide 15
Click on the Continue button to close the dialog box.
Mark the check box for Cook’s Distances to include this value in the data view and the output.
Mark the check box for Standardized Residuals, which we will need in the test for the condition of normality of the residuals.
04/21/23 Slide 16
Click on the OK button to request the output.
04/21/23 Slide 17
The number of cases with valid data to analyze the relationship between "slum population as percentage of urban population" and "infant mortality rate" was 99, out of the total of 192 cases in the data set.
04/21/23 Slide 18
The number of cases with valid data to analyze the relationship between "slum population as percentage of urban population" and "infant mortality rate" was 99, out of the total of 192 cases in the data set.
Mark the check box for a correct statement.
04/21/23 Slide 19
The next statement asks us to determine whether or not the data for the variables satisfies the conditions required for linear regression.
Making inferences about the population based on linear regression requires four conditions or assumptions: a linear relationship between the variables, equal variance of the residuals across the predicted values, no outliers or influential cases distorting the relationship, and a normally distribution for the residuals.
04/21/23 Slide 20
To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.
To evaluate the linearity condition, we create a scatterplot.
04/21/23 Slide 21
In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.
Click on the Define button to go to the next step.
04/21/23 Slide 22
First, move the dependent variable infmort to the Y axis text box.
Second, move the independent variable slumpct to the X axis text box.
Third, click on the OK button to produce the plot.
04/21/23 Slide 23
The scatterplot appears in the SPSS output window.
To facilitate our determination about the linearity of the plot, we will add a linear fit line, a loess fit line, and a confidence interval to the plot.
See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.
04/21/23 Slide 24
The criteria we use for evaluating linearity is a comparison of the loess fit line to the linear fit line. If the loess fit line falls within a 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations over the lines of the confidence interval are ignored.
The loess fit line in the scatterplot of the relationship between "slum population as percentage of urban population" and "infant mortality rate" does not lie within the confidence interval around the linear fit line. The pattern of points in the scatterplot shows an obvious curve, indicating non-linearity.
We will re-express one or both variables if they are badly skewed to see if the relationship using transformed variables satisfies the assumption of linearity.
04/21/23 Slide 25
Since we did not satisfy the linearity condition, the statement is not marked.
We do not need to test the other conditions, since we know we will not meet all of them.
We will re-express one or both variables if they are badly skewed to see if the relationship using transformed variables satisfies the assumption of linearity.
04/21/23 Slide 26
When the raw data does not satisfy the conditions of linearity and equal variance, we examine the skewness of the variables to identify problematic skewing for one or both variables that might be corrected with re-expression.
This statement suggests that the correct transformation should be a log of infant mortality rate.
We should re-express variables that have skewness equal to or less than -1.0 or equal to or greater than +1.0.
04/21/23 Slide 27
We will use the Descriptives procedure to obtain skewness for both variables.
Select Descriptive Statistics > Descriptives from the Analyze menu.
04/21/23 Slide 28
First, move the variables infmort and slumpct to the Variable(s) list box.
Second, click on the Options button to specify our choice for statistics.
04/21/23 Slide 29
Next, mark the check boxes for Kurtosis and Skewness in addition to the defaults marked by SPSSS.
Finally, click on the Continue button to close the dialog box.
04/21/23 Slide 30
Click on the OK button to produce the output.
04/21/23 Slide 31
The skewness for "infant mortality rate" [infmort] was 1.470. The skewness for "slum population as percentage of urban population" [slumpct] was -0.178.
Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale.
Since the skew for the independent variable "slum population as percentage of urban population" [slumpct] (-0.178) was between -1.0 and +1.0, we do not attempt to correct violation of assumptions by re-expressing it.
04/21/23 Slide 32
Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale.
We mark the statement as correct.
04/21/23 Slide 33
The next statement asks us to determine whether or not the data using the re-expressed variable satisfies the conditions required for linear regression.
We check to see if the re-expressed variables satisfy the four conditions or assumptions required to make inferences about the population based on linear regression: a linear relationship between the variables, equal variance of the residuals across the predicted values, no outliers or influential cases distorting the relationship, and a normally distribution for the residuals.
04/21/23 Slide 34
We first create the transformed variable, the logarithm of infmort.
Select the Compute Variable command from the Transform menu.
04/21/23 Slide 35
First, type the name for the re-expressed variable in the Target Variable text box.
The directions for the problem give us the formula for the transformation:
The formulas to transform "infant mortality rate" are "LG10(infmort)" and "(infmort)**2".
Second, type the formula in the Numeric Expression text box.
Third, click on the OK button to compute the transformation.
04/21/23 Slide 36
Next, we create the scatterplot for the relationship with the re-expressed variable.
To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.
04/21/23 Slide 37
In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.
Click on the Define button to go to the next step.
04/21/23 Slide 38
First, move the dependent variable LG_infmort to the Y axis text box.
Second, move the independent variable slumpct to the X axis text box.
Third, click on the OK button to produce the plot.
04/21/23 Slide 39
The scatterplot looks linear, but to make sure we will add fit lines and a confidence interval.
The criteria we use for evaluating linearity is a visual comparison of the loess fit line to the linear fit line. If the loess fit line falls within the 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations within the confidence interval or over the boundary of the confidence interval are ignored.
04/21/23 Slide 40
The loess fit line in the scatterplot of the relationship between "slum population as percentage of urban population" and the log transformation of "infant mortality rate" lies within the confidence interval around the linear fit line. The relationship is sufficiently linear to satisfy the assumption of linearity.
04/21/23 Slide 41
To compute a simple linear regression, select Regression> Linear from the Analyze menu.
We next do the regression analysis using the transformed variable, creating the residual plot and the normality plot in the process.
04/21/23 Slide 42
First, move the dependent variable, LG_infmort, to the Dependent text box.
Second, move the independent variable, slumpct, to the Independent(s) list box.
Third, click on the Statistics button to request basic descriptive statistics.
04/21/23 Slide 43
First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.
Third, click on the Continue button to close the dialog box.
Second, click on the Casewise diagnostics check box to produce the table with information about outliers and influential cases.
04/21/23 Slide 44
Next, click on the Plots button to request the residual plot and the normality plot.
04/21/23 Slide 45
Second, move *ZPRED (for standardized predictions) to the Y axis text box.
First, move *ZRESID (for standardized residuals) to the Y axis text box.
Fourth, click on the Continue button to close the dialog box.
Third, mark the check box for a histogram and a normal probability plot of the residuals.
04/21/23 Slide 46
Next, click on the Save button to include Cook’s distance in the output.
04/21/23 Slide 47
Click on the Continue button to close the dialog box.
Mark the check box for Cook’s Distances to include this value in the data view and the output.
Mark the check box for Standardized Residuals, which we will need to test for the condition of normality of the residuals.
04/21/23 Slide 48
Click on the OK button to request the output.
04/21/23 Slide 49
The criteria we use for evaluating equal variance is a visual inspection of the residual plot to determine whether the horizontal pattern of the points is more rectangular or more funnel shaped, i.e. narrowly spread at one end of the plot and widely spread at the other end. If the plot of the residuals is more rectangular, the assumption of equal variance is satisfied. If the plot of the residuals is more funnel-shaped, the assumption of equal variance is not satisfied.
04/21/23 Slide 50
Because it is often difficult to distinguish when the pattern of the points is rectangular or funnel-shaped, we will supplement the evaluation of equal variance with a diagnostic statistical test: the Breusch-Pagan test. The Breusch-Pagan statistic tests the null hypothesis that the variance of the residuals is the same for all values of the independent variable. When the probability of Breusch-Pagan statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the variance of residuals is different for residuals and we do not satisfy the equal variance assumption.
04/21/23 Slide 51
To use the syntax file downloaded from the course web site, select the Open > Syntax command from the Open menu.
Download the syntax file, BreuschPaganSyntax.SPS from the course web site.
04/21/23 Slide 52
Click on the Open button to open the syntax file.
Highlight the syntax file, BreuschPaganSyntax.SPS.
04/21/23 Slide 53
The file opens in the SPSS Syntax Editor.
The syntax file uses the Data Editor to store its results, creating all of these additional variables. These DELETE command remove the extra variables. If the syntax is run without these variables, SPSS will issue warning messages which have no real consequence.
If the file is run more than once, SPSS will generate a number of warning messages that it will not replace variables that were previously, and we may not be looking at the correct results for our problem.
We need to replace the names for the dependent and independent variables.
Highlight the text for dependentVariableName.
04/21/23 Slide 54
Type the name of the dependent variable, LG_infmort.
Highlight the text for independentVariableName.
04/21/23 Slide 55
First, replace the highlighted text with the name of the independent variable.
Entering the names of the variables is all that we need to change.
To execute the commands in the syntax file, select All from the Run menu second.
Note: be careful so that the period at the end of the command lines are not deleted.
04/21/23 Slide 56
Since we had not run the syntax file before, SPSS produces a warning message for each of the variable names on the DELETE commands. It thinks that we are asking it to delete a variable that does not exist and it wants to let us know.
These warning messages have no consequence.
04/21/23 Slide 57
The syntax file added all of these variables (and more to the left) to the data editor.
The syntax file omits cases with missing data from the analysis.
04/21/23 Slide 58
The interpretation of equal variance based on visual inspection of the residual plot is supported by the Breusch-Pagan statistic of 3.300 with a probability of p = .069, greater than the alpha of p = .050. The null hypothesis is not rejected, and the assumption of equal variance is supported.
The variable bp contains the Breusch-Pagan statistic and the column bpSig contains the p-value for the statistic.
Having satisfied the condition for equal variance, we next check for influential cases.
04/21/23 Slide 59
Outliers and influential cases require can alter the regression model that would otherwise represent the majority of cases in the analysis. SPSS will save Cook's distances as a measure of influence to the data editor so we can identify that have a large Cook's distance. We will operationally define a large Cook's distance as a value of 0.5 or more.
When we ran the regression using LG_infmort as the dependent variable, we requested that Cook’s distances be saved to the Data Editor and that our output include Casewise diagnostics.
In the table titled “Residuals Statistics”, we see that the maximum Cook’s distance was .152, less than the criteria of 0.5.
In this problem, there were no cases that had a Cook's distance of 0.5 or greater, qualifying as influential cases.
Since we have no outliers or influential cases, we will test the final criteria of normality of the residuals.
04/21/23 Slide 60
The linear regression model expects the residuals to have a normal distribution. The distribution of the residuals is evaluated with the normality plot which compares the points for the actual distribution of the cases to a diagonal line that represents the expected pattern for a normally distributed variable. If the points deviate substantially and consistently from the diagonal line, the residuals are not normally distributed. Minor fluctuations around the line or at either end of the line can be ignored.
In this problem, the plot of standardized regression residuals follows the diagonal, indicating that the residuals are normally distributed.
04/21/23 Slide 61
Because it is often difficult to distinguish whether or not points deviate substantially and consistently from the diagonal line, we will supplement the evaluation of normality of residuals with a diagnostic statistical test: the Shapiro-Wilk test. The Shapiro-Wilk statistic tests the null hypothesis that the distribution of the residuals is normal. When the probability of Shapiro-Wilk statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the residuals are not normally distributed and we do not satisfy the assumption of normality.
04/21/23 Slide 62
The normality tests are part of the Explore procedure.
Select Descriptive Statistics > Explore from the Analyze menu.
04/21/23 Slide 63
Move the variable ZRE_2 to the Dependent List.
The normality statistical tests are included with the plots, so we click on the Plots button.
The normal condition requires that the residuals be normally distributed. We saved standardized residuals when we ran the regression.
The correct choice is the standardized residuals from the second analysis (ZRE_2) , in which we used the transformed variable, LG_infmort.
If we had satisfied the regression conditions without re-expressing the data, we would not have run the second regression and would have selected ZRE_1 to test for normality.
04/21/23 Slide 64
Mark the check box for Normality plots with tests.
Click on the Continue button to close the dialog box.
04/21/23 Slide 65
Click on the OK button to produce the output.
04/21/23 Slide 66
The interpretation of normal residuals is supported by the Shapiro-Wilk statistic of 0.989 with a probability of p = .612, greater than the alpha of p = .050.
The null hypothesis is not rejected, and the assumption of normal residuals. is supported.
04/21/23 Slide 67
We have satisfied all four of the conditions for making inferences based on linear regression.
Mark the check box for a correct answer.
04/21/23 Slide 68
When the p-value for the statistical test is less than or equal to alpha, we reject the null hypothesis and interpret the results of the test. If the p-value is greater than alpha, we fail to reject the null hypothesis and do not interpret the result.
04/21/23 Slide 69
The p-value for this test (p < .001) is less than or equal to the alpha level of significance (p = .050) supporting the conclusion to reject the null hypothesis.
04/21/23 Slide 70
The p-value for this test (p < .001) is less than or equal to the alpha level of significance (p = .050) supporting the conclusion to reject the null hypothesis.
Mark the question as correct.
Rejection of the null hypothesis supports the research hypothesis and we interpret the results.
04/21/23 Slide 71
Since we know that we re-expressed the data to satisfy the conditions for linear regression, we skip the question that interprets the raw variables.
04/21/23 Slide 72
The final question focuses on the strength and direction of the relationship.
04/21/23 Slide 73
The strength of the relationship is based on the multiple R statistic in the Model Summary table.
Applying Cohen's criteria for effect size (less than ±0.10 = trivial; ±0.10 up to ±0.30 = weak or small; ±0.30 up to ±0.50 = moderate; ±0.50 or greater = strong or large), the relationship was correctly characterized as a strong relationship (R = .795).
Note: in SPSS output, the R statistic is always positive, so it does not show the direction of the relationship. The direction of the relationship is based on the b coefficient.
04/21/23 Slide 74
Since the sign of the b coefficient was positive (b = .01), the relationship is positive and the values for the variables move in the same direction. Higher scores on the variable "slum population as percentage of urban population" were associated with higher scores on the log transformation of "infant mortality rate".
04/21/23 Slide 75
The strength and direction of the relationship were both correctly stated.
The question is marked as correct.
04/21/23 Slide 76
Logic outline for homework problems
Both variables arequantitative?
Yes
Do not mark check box.
Mark statement check box.
No
Mark only “None of the above.”
Stop.
Number of valid cases stated correctly?
Do not mark check box.No
Yes
Mark statement check box.
04/21/23 Slide 77
Yes
NoRelationship between
variables is linear?
Variance of residuals is homogeneous?
No outliers impacting regression solution?
Residuals are normally distributed?
Yes
Yes
Yes
Mark check box for regression conditions
Do not mark check box.
No
No
No
Linear pattern in scatterplot
Residual plot and Breusch-Pagan test
Normality plot and Shapiro-Wilk test orCentral Limit Theorem
Cook’s distance < 0.5
04/21/23 Slide 78
Skew of variables ≤ - 1.0, ≥ +1.0?
Yes
Do not mark re-expression check
box.
No
Stop.
Re-express badly skewed variables
With no skewed variables, we do not have a strategy for meeting conditions.
Since we have satisfied regression conditions, the question on re-expressing data is skipped.
04/21/23 Slide 79
Yes
NoRelationship between
variables is linear?
Variance of residuals is homogeneous?
No outliers impacting regression solution?
Residuals are normally distributed?
Yes
Yes
Yes
Mark check box for regression conditions
Do not mark check box.
No
No
No
Stop.
Linear pattern in scatterplot
Residual plot and Breusch-Pagan test
Normality plot and Shapiro-Wilk test orCentral Limit Theorem
Cook’s distance < 0.5
We can’t meet conditions.
Since we have satisfied regression conditions, we do not re-express and do not check conditions.
04/21/23 Slide 80
Yes
Do not mark check box.
No
Mark statement check box.
Reject H0 is correct decision (p ≤ alpha)?
Stop.
We interpret results only if we reject null hypothesis.
Interpretation is stated correctly?
Yes
Do not mark check box.
Mark statement check box.
No
The interpretation is stated for both raw data and re-expressed data.