+ All Categories
Home > Documents > SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers...

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers...

Date post: 21-Dec-2015
Category:
View: 254 times
Download: 10 times
Share this document with a friend
Popular Tags:
104
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and Outliers Strategy for Solving Problems Practice Problems
Transcript
Page 1: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 1

Multiple Regression – Assumptions and Outliers

Multiple Regression and Assumptions

Multiple Regression and Outliers

Strategy for Solving Problems

Practice Problems

Page 2: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 2

Multiple Regression and Assumptions

Multiple regression is most effect at identifying relationship between a dependent variable and a combination of independent variables when its underlying assumptions are satisfied: each of the metric variables are normally distributed, the relationships between metric variables are linear, and the relationship between metric and dichotomous variables is homoscedastic.

Failing to satisfy the assumptions does not mean that our answer is wrong. It means that our solution may under-report the strength of the relationships.

Page 3: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 3

Multiple Regression and Outliers

Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set.

We will check for univariate outliers on the dependent variable and multivariate outliers on the independent variables.

Page 4: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 4

Relationship between assumptions and outliers

The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution.

Removing an outlier may improve the distribution of a variable.

Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

Page 5: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 5

Order of analysis is important

The order in which we check assumptions and detect outliers will affect our results because we may get a different subset of cases in the final analysis.

In order to maximize the number of cases available to the analysis, we will evaluate assumptions first. We will substitute any transformations of variable that enable us to satisfy the assumptions.

We will use any transformed variables that are required in our analysis to detect outliers.

Page 6: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 6

Strategy for solving problems

1. Run type of regression specified in problem statement on variables using full data set.

2. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable.

3. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used.

4. Substitute transformations and run regression entering all independent variables, saving studentized residuals and Mahalanobis distance scores. Compute probabilities for D².

5. Remove the outliers (studentized residual greater than 3 or Mahalanobis D² with p <= 0.001), and run regression with the method and variables specified in the problem.

6. Compare R² for analysis using transformed variables and omitting outliers (step 5) to R² obtained for model using all data and original variables (step 1).

Our strategy for solving problems about violations of assumptions and outliers will include the following steps:

Page 7: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 7

Transforming dependent variables

If dependent variable is not normally distributed: Try log, square root, and inverse

transformation. Use first transformed variable that satisfies normality criteria.

If no transformation satisfies normality criteria, use untransformed variable and add caution for violation of assumption.

If a transformation satisfies normality, use the transformed variable in the tests of the independent variables.

We will use the following logic to transform variables:

Page 8: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 8

Transforming independent variables - 1

If independent variable is normally distributed and linearly related to dependent variable, use as is.

If independent variable is normally distributed but not linearly related to dependent variable: Try log, square root, square, and inverse

transformation. Use first transformed variable that satisfies linearity criteria and does not violate normality criteria

If no transformation satisfies linearity criteria and does not violate normality criteria, use untransformed variable and add caution for violation of assumption

Page 9: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 9

Transforming independent variables - 2

If independent variable is linearly related to dependent variable but not normally distributed: Try log, square root, and inverse

transformation. Use first transformed variable that satisfies normality criteria and does not reduce correlation.

Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation.

If no transformation satisfies normality criteria with a significant correlation, use untransformed variable and add caution for violation of assumption

Page 10: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 10

Transforming independent variables - 3

If independent variable is not linearly related to dependent variable and not normally distributed: Try log, square root, square, and inverse

transformation. Use first transformed variable that satisfies normality criteria and has significant correlation.

If no transformation satisfies normality criteria with a significant correlation, used untransformed variable and add caution for violation of assumption

Page 11: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 11

Impact of transformations and omitting outliers

We evaluate the regression assumptions and detect outliers with a view toward strengthening the relationship.

This may not happen. The regression may be the same, it may be weaker, and it may be stronger. We cannot be certain of the impact until we run the regression again.

In the end, we may opt not to exclude outliers and not to employ transformations; the analysis informs us of the consequences of doing either.

Page 12: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 12

Notes

Whenever you start a new problem, make sure you have removed variables created for previous analysis and have included all cases back into the data set.

I have added the square transformation to the checkboxes for transformations in the normality script. Since this is an option for linearity, we need to be able to evaluate its impact on normality.

If you change the options for output in pivot tables from labels to names, you will get an error message when you use the linearity script. To solve the problem, change the option for output in pivot tables back to labels.

Page 13: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 13

Problem 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Page 14: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 14

Dissecting problem 1 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The problem may give us different levels of significance for the analysis.

In this problem, we are told to use 0.01 as alpha for the regression analysis as well as for testing assumptions.

Page 15: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 15

Dissecting problem 1 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The method for selecting variables is derived from the research question.

In this problem we are asked to idnetify the best subset of predicotrs, so we do a stepwise multiple regression.

Page 16: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 16

Dissecting problem 1 - 3

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The purpose of testing for assumptions and outliers is to identify a stronger model. The main question to be answered in this problem is whether or not the use transformed variables to satisfy assumptions and the removal of outliers improves the overall relationship between the independent variables and the dependent variable, as measured by R².

Specifically, the question asks whether or not the R² for a regression analysis after substituting transformed variables and eliminating outliers is 10.8% higher than a regression analysis using the original format for all variables and including all cases.

Page 17: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 17

R² before transformations or removing outliers

To start out, we run a stepwise multiple regression analysis with income98 as the dependent variable and sex, earnrs, and rincom98 as the independent variables.

We select stepwise as the method to select the best subset of predictors.

Page 18: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 18

R² before transformations or removing outliers

Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 51.1%. This is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers.

Page 19: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 19

R² before transformations or removing outliers

For this particular question, we are not interested in the statistical significance of the overall relationship prior to transformations and removing outliers. In fact, it is possible that the relationship is not statistically significant due to variables that are not normal, relationships that are not linear, and the inclusion of outliers.

Page 20: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 20

Normality of the dependent variable: total family income

In evaluating assumptions, the first step is to examine the normality of the dependent variable. If it is not normally distributed, or cannot be normalized with a transformation, it can affect the relationships with all other variables.

To test the normality of the dependent variable, run the script: NormalityAssumptionAndTransformations.SBS

Second, click on the OK button to produce the output.

First, move the dependent variable INCOME98 to the list box of variables to test.

Page 21: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 21

Descriptives

15.67 .349

14.98

16.36

15.95

17.00

27.951

5.287

1

23

22

8.00

-.628 .161

-.248 .320

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

TOTAL FAMILY INCOMEStatistic Std. Error

Normality of the dependent variable: total family income

The dependent variable "total family income" [income98] satisfies the criteria for a normal distribution. The skewness (-0.628) and kurtosis (-0.248) were both between -1.0 and +1.0. No transformation is necessary.

Page 22: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 22

Linearity and independent variable: how many in family earned money

To evaluate the linearity of the relationship between number of earners and total family income, run the script for the assumption of linearity:

LinearityAssumptionAndTransformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable INCOME98 to the text box for the dependent variable.

Second, move the independent variable, EARNRS, to the list box for independent variables.

Page 23: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 23

Correlations

1 .505** .536

. .000 .000

229 228 228

.505** 1 .959

.000 . .000

228 269 269

.536** .959** 1

.000 .000 .

228 269 269

.376** .908** .759

.000 .000 .000

228 269 269

.527** .989** .990

.000 .000 .000

228 269 269

.526** .871** .973

.000 .000 .000

228 269 269

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

TOTAL FAMILY INCOME

HOW MANY IN FAMILYEARNED MONEY

Logarithm of EARNRS[LG10( 1+EARNRS)]

Square of EARNRS[(EARNRS)**2]

Square Root of EARNRS[SQRT( 1+EARNRS)]

Inverse of EARNRS [-1/(1+EARNRS)]

TOTALFAMILYINCOME

HOW MANYIN FAMILYEARNEDMONEY

Logarithm ofEARNRS

[LG10(1+EARNRS)]

Correlation is significant at the 0.01 level (2-tailed).**.

Linearity and independent variable: how many in family earned money

The independent variable "how many in family earned money" [earnrs] satisfies the criteria for the assumption of linearity with the dependent variable "total family income" [income98], but does not satisfy the assumption of normality. The evidence of linearity in the relationship between the independent variable "how many in family earned money" [earnrs] and the dependent variable "total family income" [income98] was the statistical significance of the correlation coefficient (r = 0.505). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.

Page 24: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 24

Normality of independent variable:how many in family earned money

After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable.

To test the normality of number of earners in family, run the script: NormalityAssumptionAndTransformations.SBS

Second, click on the OK button to produce the output.

First, move the independent variable EARNRS to the list box of variables to test.

Page 25: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 25

Descriptives

1.43 .061

1.31

1.56

1.37

1.00

1.015

1.008

0

5

5

1.00

.742 .149

1.324 .296

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HOW MANY IN FAMILYEARNED MONEY

Statistic Std. Error

Normality of independent variable:how many in family earned money

The independent variable "how many in family earned money" [earnrs] satisfies the criteria for the assumption of linearity with the dependent variable "total family income" [income98], but does not satisfy the assumption of normality.

In evaluating normality, the skewness (0.742) was between -1.0 and +1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.

Page 26: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 26

Normality of independent variable:how many in family earned money

The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98]. In evaluating normality, the skewness (-0.483) and kurtosis (-0.309) were both within the range of acceptable values from -1.0 to +1.0. The correlation coefficient for the transformed variable is 0.536.

The square root transformation also has values of skewness and kurtosis in the acceptable range.

However, by our order of preference for which transformation to use, the logarithm is preferred.

Page 27: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 27

Transformation for how many in family earned money

The independent variable, how many in family earned money, had a linear relationship to the dependent variable, total family income.

The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98].

We will substitute the logarithmic transformation of how many in family earned money in the regression analysis.

Page 28: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 28

Normality of independent variable:respondent’s income

After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable.

To test the normality of respondent’s in family, run the script: NormalityAssumptionAndTransformations.SBS

Second, click on the OK button to produce the output.

First, move the independent variable RINCOM89 to the list box of variables to test.

Page 29: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 29

Descriptives

13.35 .419

12.52

14.18

13.54

15.00

29.535

5.435

1

23

22

8.00

-.686 .187

-.253 .373

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

RESPONDENTS INCOMEStatistic Std. Error

Normality of independent variable: respondent’s income

The independent variable "income" [rincom98] satisfies the criteria for both the assumption of normality and the assumption of linearity with the dependent variable "total family income" [income98].

In evaluating normality, the skewness (-0.686) and kurtosis (-0.253) were both within the range of acceptable values from -1.0 to +1.0.

Page 30: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 30

Linearity and independent variable: respondent’s income

To evaluate the linearity of the relationship between respondent’s income and total family income, run the script for the assumption of linearity:

LinearityAssumptionAndTransformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable INCOME98 to the text box for the dependent variable.

Second, move the independent variable, RINCOM89, to the list box for independent variables.

Page 31: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 31

Correlations

1 .577** -.595** .613

. .000 .000 .000

229 163 163 163

.577** 1 -.922** .967

.000 . .000 .000

163 168 168 168

-.595** -.922** 1 -.976

.000 .000 . .000

163 168 168 168

.613** .967** -.976** 1

.000 .000 .000 .

163 168 168 168

-.601** -.985** .974** -.993

.000 .000 .000 .000

163 168 168 168

-.434** -.602** .848** -.718

.000 .000 .000 .000

163 168 168 168

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

TOTAL FAMILY INCOME

RESPONDENTS INCOME

Logarithm of RINCOM98[LG10( 24-RINCOM98)]

Square of RINCOM98[(RINCOM98)**2]

Square Root ofRINCOM98 [SQRT(24-RINCOM98)]

Inverse of RINCOM98 [-1/(24-RINCOM98)]

TOTALFAMILYINCOME

RESPONDENTS INCOME

Logarithm ofRINCOM98

[LG10(24-RINCOM

98)]

Square ofRINCOM98[(RINCOM9

8)**2]

Correlation is significant at the 0.01 level (2-tailed).**.

Linearity and independent variable: respondent’s income

The evidence of linearity in the relationship between the independent variable "income" [rincom98] and the dependent variable "total family income" [income98] was the statistical significance of the correlation coefficient (r = 0.577). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.

Page 32: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 32

Homoscedasticity: sex

To evaluate the homoscedasticity of the relationship between sex and total family income, run the script for the assumption of homogeneity of variance:

HomoscedasticityAssumptionAnd Transformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable INCOME98 to the text box for the dependent variable.

Second, move the independent variable, SEX, to the list box for independent variables.

Page 33: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 33

Homoscedasticity: sex

Based on the Levene Test, the variance in "total family income" [income98] is homogeneous for the categories of "sex" [sex].

The probability associated with the Levene Statistic (0.031) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied.

Page 34: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 34

Adding a transformed variable

Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes.

First, move the variable that we want to transform to the list box of variables to test.

Even though we do not need a transformation for any of the variables in this analysis, we will demonstrate how to use a script, such as the normality script, to add a transformed variable to the data set, e.g. a logarithmic transformation for highest year of school.

Fourth, click on the OK button to produce the output.

Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable.

Page 35: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 35

The transformed variable in the data editor

If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set.

Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis.

Page 36: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 36

The regression to identify outliers

We use the regression procedure to identify both univariate and multivariate outliers.

We start with the same dialog we used for the last analysis, in which income98 as the dependent variable and sex, earnrs, and rincom98 were the independent variables.

Third, we want to save the calculated values of the outlier statistics to the data set.

Click on the Save… button to specify what we want to save.

First, we substitute the logarithmic transformation of earnrs, logearn, into the list of independent variables.

Second, we change the method of entry from Stepwise to Enter so that all variables will be included in the detection of outliers.

Page 37: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 37

Saving the measures of outliers

Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables.

Third, click on the OK button to complete the specifications.

First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set.

Page 38: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 38

The variables for identifying outliers

The variables for identifying univariate outliers for the dependent variable are in a column which SPSS has names sre_1.

The variables for identifying multivariate outliers for the independent variables are in a column which SPSS has names mah_1.

Page 39: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 39

Computing the probability for Mahalanobis D²

To compute the probability of D², we will use an SPSS function in a Compute command.

First, select the Compute… command from the Transform menu.

Page 40: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 40

Formula for probability for Mahalanobis D²

Third, click on the OK button to signal completion of the computer variable dialog.

Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3.

Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution.

First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score.

Page 41: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 41

Multivariate outliers

Using the probabilities computed in p_mah_1 to identify outliers, scroll down through the list of case to see if we can find cases with a probability less than 0.001.

There are no outliers for the set of independent variables.

Page 42: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 42

Univariate outliers

Similarly, we can scroll down the values of sre_1, the studentized residual to see the one outlier with a value larger than ± 3.0.

Based on these criteria, there are 4 outliers.There are 4 cases that have a score on the dependent variable that is sufficiently unusual to be considered outliers (case 20000357: studentized residual=3.08; case 20000416: studentized residual=3.57; case 20001379: studentized residual=3.27; case 20002702: studentized residual=-3.23).

Page 43: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 43

Omitting the outliers

To omit the outliers from the analysis, we select in the cases that are not outliers.

First, select the Select Cases… command from the Transform menu.

Page 44: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 44

Specifying the condition to omit outliers

First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases.

Second, click on the If… button to specify the criteria for inclusion in the analysis.

Page 45: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 45

The formula for omitting outliers

To eliminate the outliers, we request the cases that are not outliers.

The formula specifies that we should include cases if the studentized residual (regardless of sign) if less than 3 and the probability for Mahalanobis D² is higher than the level of significance, 0.001.

After typing in the formula, click on the Continue button to close the dialog box,

Page 46: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 46

Completing the request for the selection

To complete the request, we click on the OK button.

Page 47: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 47

The omitted multivariate outlier

SPSS identifies the excluded cases by drawing a slash mark through the case number. Most of the slashes are for cases with missing data, but we also see that the case with the low probability for Mahalanobis distance is included in those that will be omitted.

Page 48: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 48

Running the regression without outliers

We run the regression again, excluding the outliers. Select the Regression | Linear command from the Analyze menu.

Page 49: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 49

Opening the save options dialog

We specify the dependent and independent variables, substituting any transformed variables required by assumptions.

On our last run, we instructed SPSS to save studentized residuals and Mahalanobis distance. To prevent these values from being calculated again, click on the Save… button.

When we used regression to detect outliers, we entered all variables. Now we are testing the relationship specified in the problem, so we change the method to Stepwise.

Page 50: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 50

Clearing the request to save outlier data

First, clear the checkbox for Studentized residuals.

Third, click on the OK button to complete the specifications.

Second, clear the checkbox form Mahalanobis distance.

Page 51: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 51

Opening the statistics options dialog

Once we have removed outliers, we need to check the sample size requirement for regression.

Since we will need the descriptive statistics for this, click on the Statistics… button.

Page 52: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 52

Requesting descriptive statistics

First, mark the checkbox for Descriptives.

Second, click on the Continue button to complete the specifications.

Page 53: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 53

Requesting the output

Having specified the output needed for the analysis, we click on the OK button to obtain the regression output.

Page 54: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 54

Descriptive Statistics

17.09 4.073 159

1.55 .499 159

13.76 5.133 159

.424896 .1156559 159

TOTAL FAMILY INCOME

RESPONDENTS SEX

RESPONDENTS INCOME

Logarithm of EARNRS[LG10( 1+EARNRS)]

Mean Std. Deviation N

Sample size requirement

The minimum ratio of valid cases to independent variables for stepwise multiple regression is 5 to 1. After removing 4 outliers, there are 159 valid cases and 3 independent variables.

The ratio of cases to independent variables for this analysis is 53.0 to 1, which satisfies the minimum requirement. In addition, the ratio of 53.0 to 1 satisfies the preferred ratio of 50 to 1.

Page 55: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 55

ANOVAd

1122.398 1 1122.398 117.541 .000a

1499.187 157 9.549

2621.585 158

1572.722 2 786.361 116.957 .000b

1048.863 156 6.723

2621.585 158

1623.976 3 541.325 84.107 .000c

997.609 155 6.436

2621.585 158

Regression

Residual

Total

Regression

Residual

Total

Regression

Residual

Total

Model1

2

3

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), RESPONDENTS INCOMEa.

Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(1+EARNRS)]

b.

Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(1+EARNRS)], RESPONDENTS SEX

c.

Dependent Variable: TOTAL FAMILY INCOMEd.

Significance of regression relationship

The probability of the F statistic (84.107) for the regression relationship which includes these variables is <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that there is no relationship between the best subset of independent variables and the dependent variable (R² = 0).

We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable.

Page 56: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 56

Model Summary

.654a .428 .424 3.090

.775b .600 .595 2.593

.787c .619 .612 2.537

Model1

2

3

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), RESPONDENTS INCOMEa.

Predictors: (Constant), RESPONDENTS INCOME,Logarithm of EARNRS [LG10( 1+EARNRS)]

b.

Predictors: (Constant), RESPONDENTS INCOME,Logarithm of EARNRS [LG10( 1+EARNRS)],RESPONDENTS SEX

c.

Increase in proportion of variance

Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 51.1%.

After transformed variables were substituted to satisfy assumptions and outliers were removed from the sample, the proportion of variance explained by the regression analysis was 61.9%, a difference of 10.8%.

The answer to the question is true with caution.

A caution is added because of the inclusion of ordinal level variables.

Page 57: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 57

Problem 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.

1. True 2. True with caution3. False4. Inappropriate application of a statistic

Page 58: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 58

Dissecting problem 2 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.

1. True 2. True with caution3. False4. Inappropriate application of a statistic

The problem may give us different levels of significance for the analysis.

In this problem, we are told to use 0.05 as alpha for the regression analysis and the more conservative 0.01 as the alpha in testing assumptions.

Page 59: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 59

Dissecting problem 2 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.

1. True 2. True with caution3. False4. Inappropriate application of a statistic

The method for selecting variables is derived from the research question.

If we are asked to examine a relationship without any statement about control variables or the best subset of variables, we do a standard multiple regression.

Page 60: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 60

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].

After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.

1. True 2. True with caution3. False4. Inappropriate application of a statistic

Dissecting problem 2 - 3

The purpose of testing for assumptions and outliers is to identify a stronger model. The main question to be answered in this problem is whether or not the use transformed variables to satisfy assumptions and the removal of outliers improves the overall relationship between the independent variables and the dependent variable, as measured by R².

Specifically, the question asks whether or not the R² for a regression analysis after substituting transformed variables and eliminating outliers is 3.6% higher than a regression analysis using the original format for all variables and including all cases.

Page 61: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 61

R² before transformations or removing outliers

To start out, we run a standard multiple regression analysis with prestg80 as the dependent variable and age, educ, and sex as the independent variables.

Page 62: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 62

R² before transformations or removing outliers

Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 27.1%. This is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers.

For this particular question, we are not interested in the statistical significance the overall relationship prior to transformations and removing outliers. In fact, it is possible that the relationship is not statistically significant due to variables that are not normal, relationships that are not linear, and the inclusion of outliers.

Page 63: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 63

Normality of the dependent variable

In evaluating assumptions, the first step is to examine the normality of the dependent variable. If it is not normally distributed, or cannot be normalized with a transformation, it can affect the relationships with all other variables.

To test the normality of the dependent variable, run the script: NormalityAssumptionAndTransformations.SBS

Second, click on the OK button to produce the output.

First, move the dependent variable PRESTG80 to the list box of variables to test.

Page 64: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 64

Normality of the dependent variable

The dependent variable "occupational prestige score" [prestg80] satisfies the criteria for a normal distribution. The skewness (0.401) and kurtosis (-0.630) were both between -1.0 and +1.0. No transformation is necessary.

Page 65: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 65

Normality of independent variable: Age

After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable.

To test the normality of age, run the script: NormalityAssumptionAndTransformations.SBS

Second, click on the OK button to produce the output.

First, move the independent variable AGE to the list box of variables to test.

Page 66: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 66

Descriptives

45.99 1.023

43.98

48.00

45.31

43.50

282.465

16.807

19

89

70

24.00

.595 .148

-.351 .295

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

AGE OF RESPONDENTStatistic Std. Error

Normality of independent variable: Age

The independent variable "age" [age] satisfies the criteria for the assumption of normality, but does not satisfy the assumption of linearity with the dependent variable "occupational prestige score" [prestg80].

In evaluating normality, the skewness (0.595) and kurtosis (-0.351) were both within the range of acceptable values from -1.0 to +1.0.

Page 67: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 67

Linearity and independent variable: Age

To evaluate the linearity of the relationship between age and occupational prestige, run the script for the assumption of linearity:

LinearityAssumptionAndTransformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable PRESTG80 to the text box for the dependent variable.

Second, move the independent variable, AGE, to the list box for independent variables.

Page 68: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 68

Correlations

1 .024 .059 -.004 .041

. .706 .348 .956 .518

255 255 255 255 255

.024 1 .979** .983** .995

.706 . .000 .000 .000

255 270 270 270 270

.059 .979** 1 .926** .994

.348 .000 . .000 .000

255 270 270 270 270

-.004 .983** .926** 1 .960

.956 .000 .000 . .000

255 270 270 270 270

.041 .995** .994** .960** 1

.518 .000 .000 .000 .

255 270 270 270 270

.096 .916** .978** .832** .951

.128 .000 .000 .000 .000

255 270 270 270 270

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

RS OCCUPATIONALPRESTIGE SCORE (1980)

AGE OF RESPONDENT

Logarithm of AGE[LG10(AGE)]

Square of AGE [(AGE)**2]

Square Root of AGE[SQRT(AGE)]

Inverse of AGE [-1/(AGE)]

RSOCCUPATIONAL

PRESTIGE SCORE

(1980)

AGE OFRESPON

DENT

Logarithm ofAGE

[LG10(AGE)]

Square ofAGE

[(AGE)**2]

Square Rootof AGE

[SQRT(AGE)]

Correlation is significant at the 0.01 level (2-tailed).**.

Linearity and independent variable: Age

The evidence of nonlinearity in the relationship between the independent variable "age" [age] and the dependent variable "occupational prestige score" [prestg80] was the lack of statistical significance of the correlation coefficient (r = 0.024). The probability for the correlation coefficient was 0.706, greater than the level of significance of 0.01. We cannot reject the null hypothesis that r = 0, and cannot conclude that there is a linear relationship between the variables.

Since none of the transformations to improve linearity were successful, it is an indication that the problem may be a weak relationship, rather than a curvilinear relationship correctable by using a transformation. A weak relationship is not a violation of the assumption of linearity, and does not require a caution.

Page 69: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 69

Transformation for Age

The independent variable age satisfied the criteria for normality.

The independent variable age did not have a linear relationship to the dependent variable occupational prestige. However, none of the transformations linearized the relationship.

No transformation will be used - it would not help linearity and is not needed for normality.

Page 70: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 70

Linearity and independent variable: Highest year of school completed

To evaluate the linearity of the relationship between highest year of school and occupational prestige, run the script for the assumption of linearity:

LinearityAssumptionAndTransformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable PRESTG80 to the text box for the dependent variable.

Second, move the independent variable, EDUC, to the list box for independent variables.

Page 71: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 71

Correlations

1 .495** -.512** .528** -.518

. .000 .000 .000 .000

255 254 254 254 254

.495** 1 -.920** .980** -.982

.000 . .000 .000 .000

254 269 269 269 269

-.512** -.920** 1 -.969** .977

.000 .000 . .000 .000

254 269 269 269 269

.528** .980** -.969** 1 -.997

.000 .000 .000 . .000

254 269 269 269 269

-.518** -.982** .977** -.997** 1

.000 .000 .000 .000 .

254 269 269 269 269

-.423** -.699** .915** -.789** .812

.000 .000 .000 .000 .000

254 269 269 269 269

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

RS OCCUPATIONALPRESTIGE SCORE (1980)

HIGHEST YEAR OFSCHOOL COMPLETED

Logarithm of EDUC[LG10( 21-EDUC)]

Square of EDUC[(EDUC)**2]

Square Root of EDUC[SQRT( 21-EDUC)]

Inverse of EDUC [-1/(21-EDUC)]

RSOCCUPATIONAL

PRESTIGE SCORE

(1980)

HIGHESTYEAR OFSCHOOL

COMPLETED

Logarithm ofEDUC [LG10(

21-EDUC)]

Square ofEDUC

[(EDUC)**2]

Square Rootof EDUC[SQRT(

21-EDUC)]

Correlation is significant at the 0.01 level (2-tailed).**.

Linearity and independent variable: Highest year of school completed

The independent variable "highest year of school completed" [educ] satisfies the criteria for the assumption of linearity with the dependent variable "occupational prestige score" [prestg80], but does not satisfy the assumption of normality. The evidence of linearity in the relationship between the independent variable "highest year of school completed" [educ] and the dependent variable "occupational prestige score" [prestg80] was the statistical significance of the correlation coefficient (r = 0.495). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.

Page 72: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 72

Normality of independent variable: Highest year of school completed

Second, click on the OK button to produce the output.

First, move the dependent variable EDUC to the list box of variables to test.

To test the normality of EDUC, Highest year of school completed, run the script:

NormalityAssumptionAndTransformations.SBS

Page 73: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 73

Descriptives

13.12 .179

12.77

13.47

13.14

13.00

8.583

2.930

2

20

18

3.00

-.137 .149

1.246 .296

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HIGHEST YEAR OFSCHOOL COMPLETED

Statistic Std. Error

Normality of independent variable: Highest year of school completed

In evaluating normality, the skewness (-0.137) was between -1.0 and +1.0, but the kurtosis (1.246) was outside the range from -1.0 to +1.0. None of the transformations for normalizing the distribution of "highest year of school completed" [educ] were effective.

Page 74: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 74

Transformation for highest year of school

The independent variable, highest year of school, had a linear relationship to the dependent variable, occupational prestige.

The independent variable, highest year of school, did not satisfy the criteria for normality. None of the transformations for normalizing the distribution of "highest year of school completed" [educ] were effective.

No transformation will be used - it would not help normality and is not needed for linearity. A caution should be added to any findings.

Page 75: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 75

Homoscedasticity: sex

To evaluate the homoscedasticity of the relationship between sex and occupational prestige, run the script for the assumption of homogeneity of variance:

HomoscedasticityAssumptionAnd Transformations.SBS

Third, click on the OK button to produce the output.

First, move the dependent variable PRESTG80 to the text box for the dependent variable.

Second, move the independent variable, SEX, to the list box for independent variables.

Page 76: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 76

Homoscedasticity: sex

Based on the Levene Test, the variance in "occupational prestige score" [prestg80] is homogeneous for the categories of "sex" [sex]. The probability associated with the Levene Statistic (0.808) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied.

Even if we violate the assumption, we would not do a transformation since it could impact the relationships of the other independent variables with the dependent variable.

Page 77: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 77

Adding a transformed variable

Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes.

First, move the variable that we want to transform to the list box of variables to test.

Even though we do not need a transformation for any of the variables in this analysis, we will demonstrate how to use a script, such as the normality script, to add a transformed variable to the data set, e.g. a logarithmic transformation for highest year of school.

Fourth, click on the OK button to produce the output.

Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable.

Page 78: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 78

The transformed variable in the data editor

If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set.

Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis.

Page 79: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 79

The regression to identify outliers

We can use the regression procedure to identify both univariate and multivariate outliers.

We start with the same dialog we used for the last analysis, in which prestg90 as the dependent variable and age, educ, and sex were the independent variables.

If we need to use any transformed variables, we would substitute them now.

We will save the calculated values of the outlier statistics to the data set.

Click on the Save… button to specify what we want to save.

Page 80: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 80

Saving the measures of outliers

Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables.

Third, click on the OK button to complete the specifications.

First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set.

Page 81: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 81

The variables for identifying outliers

The variables for identifying univariate outliers for the dependent variable are in a column which SPSS has names sre_1.

The variables for identifying multivariate outliers for the independent variables are in a column which SPSS has names mah_1.

Page 82: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 82

Computing the probability for Mahalanobis D²

To compute the probability of D², we will use an SPSS function in a Compute command.

First, select the Compute… command from the Transform menu.

Page 83: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 83

Formula for probability for Mahalanobis D²

Third, click on the OK button to signal completion of the computer variable dialog.

Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3.

Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution.

First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score.

Page 84: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 84

The multivariate outlier

Using the probabilities computed in p_mah_1 to identify outliers, scroll down through the list of case to see the one case with a probability less than 0.001.

There is 1 case that has a combination of scores on the independent variables that is sufficiently unusual to be considered an outlier (case 20001984: Mahalanobis D²=16.97, p=0.0007).

Page 85: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 85

The univariate outlier

Similarly, we can scroll down the values of sre_1, the studentized residual to see the one outlier with a value larger than 3.0.

There is 1 case that has a score on the dependent variable that is sufficiently unusual to be considered an outlier (case 20000391: studentized residual=4.14).

Page 86: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 86

Omitting the outliers

To omit the outliers from the analysis, we select in the cases that are not outliers.

First, select the Select Cases… command from the Transform menu.

Page 87: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 87

Specifying the condition to omit outliers

First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases.

Second, click on the If… button to specify the criteria for inclusion in the analysis.

Page 88: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 88

The formula for omitting outliers

To eliminate the outliers, we request the cases that are not outliers.

The formula specifies that we should include cases if the studentized residual (regardless of sign) if less than 3 and the probability for Mahalanobis D² is higher than the level of significance, 0.001.

After typing in the formula, click on the Continue button to close the dialog box,

Page 89: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 89

Completing the request for the selection

To complete the request, we click on the OK button.

Page 90: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 90

The omitted multivariate outlier

SPSS identifies the excluded cases by drawing a slash mark through the case number. Most of the slashes are for cases with missing data, but we also see that the case with the low probability for Mahalanobis distance is included in those that will be omitted.

Page 91: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 91

Running the regression without outliers

We run the regression again, excluding the outliers. Select the Regression | Linear command from the Analyze menu.

Page 92: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 92

Opening the save options dialog

If specify the dependent an independent variables. If we wanted to use any transformed variables we would substitute them now.

On our last run, we instructed SPSS to save studentized residuals and Mahalanobis distance. To prevent these values from being calculated again, click on the Save… button.

Page 93: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 93

Clearing the request to save outlier data

First, clear the checkbox for Studentized residuals.

Third, click on the OK button to complete the specifications.

Second, clear the checkbox form Mahalanobis distance.

Page 94: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 94

Opening the statistics options dialog

Once we have removed outliers, we need to check the sample size requirement for regression.

Since we will need the descriptive statistics for this, click on the Statistics… button.

Page 95: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 95

Requesting descriptive statistics

First, mark the checkbox for Descriptives.

Second, click on the Continue button to complete the specifications.

Page 96: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 96

Requesting the output

Having specified the output needed for the analysis, we click on the OK button to obtain the regression output.

Page 97: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 97

Sample size requirement

The minimum ratio of valid cases to independent variables for multiple regression is 5 to 1. After removing 2 outliers, there are 252 valid cases and 3 independent variables.

The ratio of cases to independent variables for this analysis is 84.0 to 1, which satisfies the minimum requirement. In addition, the ratio of 84.0 to 1 satisfies the preferred ratio of 15 to 1.

Page 98: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 98

Significance of regression relationship

The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0).

We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.

Page 99: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 99

Increase in proportion of variance

Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 27.1%. No transformed variables were substituted to satisfy assumptions, but outliers were removed from the sample.

The proportion of variance explained by the regression analysis after removing outliers was 30.7%, a difference of 3.6%.

The answer to the question is true with caution.

A caution is added because of a violation of regression assumptions.

Page 100: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 100

Impact of assumptions and outliers - 1

The following is a guide to the decision process for answering problems about the impact of assumptions and outliers on analysis:

Inappropriate application of a statistic

Yes

NoDependent variable metric?Independent variables metric or dichotomous?

Run baseline regression and record R² for future reference, using method for including variables identified in the research question.

Yes

Ratio of cases to independent variables at least 5 to 1?

Yes

No Inappropriate application of a statistic

Page 101: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 101

Impact of assumptions and outliers - 2

Is the dependent variable normally distributed?

Yes

NoTry: 1. Logarithmic transformation2. Square root transformation3. Inverse transformation

If unsuccessful, add caution

Metric IV’s normally distributed and linearly related to DV

Yes

No

Try: 1. Logarithmic transformation2. Square root transformation(3. Square transformation)4. Inverse transformation

If unsuccessful, add caution

DV is homoscedastic for categories of dichotomous IV’s?

Yes

NoAdd caution

Page 102: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 102

Impact of assumptions and outliers - 3

Are there univariate outliers (DV) or multivariate outliers (IVs)?

No

Yes

Ratio of cases to independent variables at least 5 to 1?

No Inappropriate application of a statistic

Yes

Run regression again using transformed variables and eliminating outliers

Remove outliers from data

Substituting any transformed variables, run regression using direct entry to include all variables to request statistics for detecting outliers

Page 103: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 103

Impact of assumptions and outliers - 4

Probability of ANOVA test of regression less than/equal to level of significance?

Yes

NoFalse

Increase in R² correct?No

False

Yes

Yes

Yes

Satisfies ratio for preferred sample size: 15 to 1(stepwise: 50 to 1)

Yes

NoTrue with caution

Page 104: SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and.

SW388R7Data Analysis

& Computers II

Slide 104

Impact of assumptions and outliers - 5

Yes

Other cautions added for ordinal variables or violation of assumptions?

Yes

No

True with caution

True

Yes


Recommended