MultipleRegression_AssumptionsAndOUtliers

7/16/2019 MultipleRegression_AssumptionsAndOUtliers

http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 1/104

SW388R7

Data Analysis &

Computers II

Slide 1

Multiple Regression – Assumptions andOutliers

Multiple Regression and Assumptions

Multiple Regression and Outliers

Strategy for Solving Problems

Practice Problems



SW388R7

Data Analysis &

Computers II

Slide 2

Multiple Regression and Assumptions

Multiple regression is most effect at identifying

relationship between a dependent variable and a

combination of independent variables when its

underlying assumptions are satisfied: each of the

metric variables are normally distributed, the

relationships between metric variables are linear,

and the relationship between metric and

dichotomous variables is homoscedastic.

Failing to satisfy the assumptions does not mean that

our answer is wrong. It means that our solution may

under-report the strength of the relationships.



SW388R7

Data Analysis &

Computers II

Slide 3

Multiple Regression and Outliers

Outliers can distort the regression results. When an

outlier is included in the analysis, it pulls the

regression line towards itself. This can result in a

solution that is more accurate for the outlier, but

less accurate for all of the other cases in the data

set.

We will check for univariate outliers on the

dependent variable and multivariate outliers on the

independent variables.



SW388R7

Data Analysis &

Computers II

Slide 4

Relationship between assumptions and outliers

The problems of satisfying assumptions and detecting

outliers are intertwined. For example, if a case has

a value on the dependent variable that is an outlier,

it will affect the skew, and hence, the normality of

the distribution.

Removing an outlier may improve the distribution of

a variable.

Transforming a variable may reduce the likelihood

that the value for a case will be characterized as an

outlier.



SW388R7

Data Analysis &

Computers II

Slide 5

Order of analysis is important

The order in which we check assumptions and detect

outliers will affect our results because we may get a

different subset of cases in the final analysis.

In order to maximize the number of cases available

to the analysis, we will evaluate assumptions first.

We will substitute any transformations of variable

that enable us to satisfy the assumptions.

We will use any transformed variables that are

required in our analysis to detect outliers.



SW388R7

Data Analysis &

Computers II

Slide 6

Strategy for solving problems

1. Run type of regression specified in problem statement on variablesusing full data set.

2. Test the dependent variable for normality. If it does not satisfy thecriteria for normality unless transformed, substitute the transformedvariable in the remaining tests that call for the use of the dependentvariable.

3. Test for normality, linearity, homoscedasticity using scripts. Decidewhich transformations should be used.

4. Substitute transformations and run regression entering all

independent variables, saving studentized residuals and Mahalanobisdistance scores. Compute probabilities for D².

5. Remove the outliers (studentized residual greater than 3 orMahalanobis D² with p <= 0.001), and run regression with the methodand variables specified in the problem.

6. Compare R² for analysis using transformed variables and omittingoutliers (step 5) to R² obtained for model using all data and original

variables (step 1).

Our strategy for solving problems about violations of assumptions and outliers will include the following steps:



SW388R7

Data Analysis &

Computers II

Slide 7

Transforming dependent variables

If dependent variable is not normally distributed:

Try log, square root, and inverse transformation.Use first transformed variable that satisfiesnormality criteria.

If no transformation satisfies normality criteria,

use untransformed variable and add caution forviolation of assumption.

If a transformation satisfies normality, use thetransformed variable in the tests of the independent

variables.

We will use the following logic to transform variables:



SW388R7

Data Analysis &

Computers II

Slide 8

Transforming independent variables - 1

If independent variable is normally distributed andlinearly related to dependent variable, use as is.

If independent variable is normally distributed butnot linearly related to dependent variable:

Try log, square root, square, and inversetransformation. Use first transformed variablethat satisfies linearity criteria and does not

violate normality criteria If no transformation satisfies linearity criteria and

does not violate normality criteria, useuntransformed variable and add caution forviolation of assumption



SW388R7

Data Analysis &

Computers II

Slide 9


If independent variable is linearly related to

dependent variable but not normally distributed:

Try log, square root, and inverse transformation.

Use first transformed variable that satisfiesnormality criteria and does not reduce correlation.

Try log, square root, and inverse transformation.

Use first transformed variable that satisfies

normality criteria and has significant correlation.

If no transformation satisfies normality criteria

with a significant correlation, use untransformed

variable and add caution for violation of

assumption



SW388R7

Data Analysis &

Computers II

Slide 10


If independent variable is not linearly related to

dependent variable and not normally distributed:

Try log, square root, square, and inverse

transformation. Use first transformed variablethat satisfies normality criteria and has significant

correlation.

If no transformation satisfies normality criteria

with a significant correlation, used untransformed

variable and add caution for violation of

assumption



SW388R7

Data Analysis &

Computers II

Slide 11

Impact of transformationsand omitting outliers

We evaluate the regression assumptions and detect

outliers with a view toward strengthening the

relationship.

This may not happen. The regression may be the

same, it may be weaker, and it may be stronger. We

cannot be certain of the impact until we run the

regression again.

In the end, we may opt not to exclude outliers and

not to employ transformations; the analysis informs

us of the consequences of doing either.



SW388R7

Data Analysis &

Computers II

Slide 12

Notes

Whenever you start a new problem, make sure youhave removed variables created for previous analysisand have included all cases back into the data set.

I have added the square transformation to thecheckboxes for transformations in the normalityscript. Since this is an option for linearity, we needto be able to evaluate its impact on normality.

If you change the options for output in pivot tablesfrom labels to names, you will get an error messagewhen you use the linearity script. To solve theproblem, change the option for output in pivot tables

back to labels.



SW388R7

Data Analysis &

Computers II

Slide 13

Problem 1

In the dataset GSS2000.sav, is the following statement true, false, or anincorrect application of a statistic? Assume that there is no problem withmissing data. Use a level of significance of 0.01 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to identify the best subset of predictors

of "total family income" [income98] from the list: "sex" [sex], "how manyin family earned money" [earnrs], and "income" [rincom98].

After substituting transformed variables to satisfy regression assumptionsand removing outliers, the total proportion of variance explained by theregression analysis increased by 10.8%.

1. True

2. True with caution

3. False

4. Inappropriate application of a statistic



SW388R7

Data Analysis &

Computers II

Slide 14

Dissecting problem 1 - 1





1. True


3. False


The problem may give us different

levels of significance for the analysis.

In this problem, we are told to use0.01 as alpha for the regressionanalysis as well as for testingassumptions.



SW388R7

Data Analysis &

Computers II

Slide 15






1. True


3. False


The method for selecting variables isderived from the research question.

In this problem we are asked to idnetify thebest subset of predicotrs, so we do astepwise multiple regression.



SW388R7

Data Analysis &

Computers II

Slide 16






1. True


3. False


The purpose of testing for assumptions and outliers is toidentify a stronger model. The main question to beanswered in this problem is whether or not the usetransformed variables to satisfy assumptions and theremoval of outliers improves the overall relationshipbetween the independent variables and the dependentvariable, as measured by R².

Specifically, the question asks whether ornot the R² for a regression analysis aftersubstituting transformed variables andeliminating outliers is 10.8% higher than aregression analysis using the original formatfor all variables and including all cases.



SW388R7

Data Analysis &

Computers II

Slide 17

R² before transformations or removing outliers

To start out, we run astepwise multiple regressionanalysis with income98 asthe dependent variable andsex, earnrs, and rincom98as the independent

variables.

We select stepwise asthe method to select thebest subset of predictors.



SW388R7

Data Analysis &

Computers II

Slide 18


Prior to any transformations of variablesto satisfy the assumptions of multipleregression or removal of outliers, theproportion of variance in the dependentvariable explained by the independentvariables (R²) was 51.1%. This is thebenchmark that we will use to evaluatethe utility of transformations and the

elimination of outliers.



SW388R7

Data Analysis &

Computers II

Slide 19


For this particular question, we are not interested in thestatistical significance of the overall relationship prior totransformations and removing outliers. In fact, it ispossible that the relationship is not statistically significantdue to variables that are not normal, relationships thatare not linear, and the inclusion of outliers.



SW388R7

Data Analysis &

Computers II

Slide 20

Normality of the dependent variable:total family income

In evaluating assumptions, the first step is toexamine the normality of the dependentvariable. If it is not normally distributed, orcannot be normalized with a transformation, itcan affect the relationships with all othervariables.

To test the normality of the dependentvariable, run the script:NormalityAssumptionAndTransformations.SBS

Second, click on theOK button to

produce the output.

First, move thedependent variableINCOME98 to the listbox of variables to test.



SW388R7

Data Analysis &

Computers II

Slide 21

Descriptives

15.67 .349

14.98

16.36

15.95

17.0027.951

5.287

1

23

22

8.00

-.628 .161

-.248 .320

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

MedianVariance

Std. Devia tion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

TOTAL FAMILY INCOM

Statistic Std. Error

Normality of the dependent variable:total family income

The dependent variable "total family income"[income98] satisfies the criteria for a normaldistribution. The skewness (-0.628) and kurtosis(-0.248) were both between -1.0 and +1.0. Notransformation is necessary.



SW388R7

Data Analysis &

Computers II

Slide 22

Linearity and independent variable:how many in family earned money

To evaluate the linearity of the relationshipbetween number of earners and total family

income, run the script for the assumption of linearity:

LinearityAssumptionAndTransformations.SBS

Third, click on theOK button to

produce the output.

First, move the dependent variableINCOME98 to the text box for thedependent variable.

Second, move theindependent variable,EARNRS, to the listbox for independentvariables.



SW388R7

Data Analysis &

Computers II

Slide 23

Correlations

1 .505**

. .000

229 228

.505** 1

.000 .

228 269

.536** .959**

.000 .000

228 269

.376** .908**

.000 .000

228 269

.527** .989**

.000 .000

228 269

.526** .871**

.000 .000

228 269

Pearson Correlation

Sig. (2-tail ed)

N

Pearson Correlation

Sig. (2-tail ed)

N

Pearson Correlation

Sig. (2-tail ed)

N

Pearson Correlation

Sig. (2-tail ed)

N

Pearson Correlation

Sig. (2-tail ed)

N

Pearson Correlation

Sig. (2-tail ed)

N

TOTAL FAMILY INCOME

HOW MANY IN FAMILY

EARNED MONEY

Logarithm of EARNRS

[LG10( 1+EARNRS)]

Square of EARNRS

[(EARNRS)**2]

Square Root of EARNRS

[SQRT( 1+EARNRS)]

Inverse of EARNRS [-1/(

1+EARNRS)]

TOTAL

FAMILY

INCOME

HOW MANY

IN FAMILY

EARNED

MONEY

L

1

Correlation i s significant at the 0 .01 level (2-tailed).**.

Linearity and independent variable:how many in family earned money

The independent variable "how many infamily earned money" [earnrs] satisfiesthe criteria for the assumption of linearity with the dependent variable"total family income" [income98], but

does not satisfy the assumption of normality. The evidence of linearity inthe relationship between theindependent variable "how many infamily earned money" [earnrs] and thedependent variable "total family income"[income98] was the statisticalsignificance of the correlation coefficient(r = 0.505). The probability for the

correlation coefficient was <0.001, lessthan or equal to the level of significanceof 0.01. We reject the null hypothesisthat r = 0 and conclude that there is alinear relationship between thevariables.



SW388R7

Data Analysis &

Computers II

Slide 24

Normality of independent variable:how many in family earned money

After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship with

the dependent variable.

To test the normality of number of earners infamily, run the script:NormalityAssumptionAndTransformations.SBS


produce the output.

First, move theindependent variableEARNRS to the list boxof variables to test.



SW388R7

Data Analysis &

Computers II

Slide 25

Descriptives

1.43 .061

1.31

1.56

1.37

1.00

1.015

1.008

0

5

5

1.00

.742 .149

1.324 .296

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Devia tion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HOW MANY IN FAMIL

EARNED MONEY



The independent variable "how many in family earned money" [earnrs]satisfies the criteria for the assumption of linearity with the dependentvariable "total family income" [income98], but does not satisfy theassumption of normality.

In evaluating normality, the skewness (0.742) was between -1.0 and+1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.



SW388R7

Data Analysis &

Computers II

Slide 26


The logarithmictransformationimproves the normalityof "how many in familyearned money" [earnrs]

without a reduction inthe strength of therelationship to "totalfamily income"[income98]. Inevaluating normality,the skewness (-0.483)and kurtosis (-0.309)were both within the

range of acceptablevalues from -1.0 to+1.0. The correlationcoefficient for thetransformed variable is0.536.

The square root transformation alsohas values of skewness and kurtosis inthe acceptable range.

However, by our order of preferencefor which transformation to use, thelogarithm is preferred.



SW388R7

Data Analysis &

Computers II

Slide 27

Transformation for how many in familyearned money

The independent variable, how many in family

earned money, had a linear relationship to the

dependent variable, total family income.

The logarithmic transformation improves the

normality of "how many in family earned money"

[earnrs] without a reduction in the strength of the

relationship to "total family income" [income98].

We will substitute the logarithmic transformation of

how many in family earned money in the regression

analysis.



SW388R7

Data Analysis &

Computers II

Slide 28

Normality of independent variable:respondent’s income

After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship with

the dependent variable.

To test the normality of respondent’s infamily, run the script:NormalityAssumptionAndTransformations.SBS


produce the output.

First, move theindependent variable

RINCOM89 to the listbox of variables totest.



SW388R7

Data Analysis &

Computers II

Slide 29

Descriptiv es

13.35 .419

12.52

14.18

13.54

15.00

29.5355.435

1

23

22

8.00

-.686 .187

-.253 .373

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

VarianceStd. Devia tion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

RESPONDENTS INCOME


Normality of independent variable:respondent’s income

The independent variable "income" [rincom98] satisfies the criteria forboth the assumption of normality and the assumption of linearity withthe dependent variable "total family income" [income98].

In evaluating normality, the skewness (-0.686) and kurtosis (-0.253)were both within the range of acceptable values from -1.0 to +1.0.



SW388R7

Data Analysis &

Computers II

Slide 30

Linearity and independent variable:respondent’s income

To evaluate the linearity of the relationshipbetween respondent’s income and total

family income, run the script for theassumption of linearity:


Third, click on theOK button to

produce the output.


Second, move theindependent variable,RINCOM89, to the listbox for independentvariables.



SW388R7

Data Analysis &

Computers II

Slide 31

Correlations

1 .577** -.595**

. .000 .000

229 163 163

.577** 1 -.922**

.000 . .000

163 168 168

-.595** -.922** 1

.000 .000 .

163 168 168

.613** .967** -.976**

.000 .000 .000

163 168 168

-.601** -.985** .974**

.000 .000 .000

163 168 168

-.434** -.602** .848**

.000 .000 .000

163 168 168

Pearson Correlation

Sig. (2-tai led)

N

Pearson Correlation

Sig. (2-tai led)

N

Pearson Correlation

Sig. (2-tai led)

N

Pearson Correlation

Sig. (2-tai led)

N

Pearson Correlation

Sig. (2-tai led)

N

Pearson Correlation

Sig. (2-tai led)

N

TOTAL FAMILY INCOME

RESPONDENTS INCOME

Logarithm of RINCOM98

[LG10( 24-RINCOM98)]

Square of RINCOM98

[(RINCOM98)**2]

Square Root of

RINCOM98 [SQRT(

24-RINCOM98)]

Inverse of RINCOM98 [-1/(

24-RINCOM98)]

TOTAL

FAMILY

INCOME

RESPONDEN

TS INCOME

Logarithm of

RINCOM98

[LG10(

24-RINCOM

98)]

Correlation is significant at the 0.01 l evel (2-tailed).**.

Linearity and independent variable:respondent’s income

The evidence of linearity in therelationship between the independentvariable "income" [rincom98] and thedependent variable "total family income"[income98] was the statisticalsignificance of the correlation coefficient(r = 0.577). The probability for thecorrelation coefficient was <0.001, lessthan or equal to the level of significanceof 0.01. We reject the null hypothesis

that r = 0 and conclude that there is alinear relationship between thevariables.



SW388R7

Data Analysis &

Computers II

Slide 32

Homoscedasticity: sex

To evaluate the homoscedasticity of the

relationship between sex and total familyincome, run the script for the assumption of homogeneity of variance:

HomoscedasticityAssumptionAndTransformations.SBS

Third, click on theOK button toproduce the output.


Second, move theindependent variable,SEX, to the list box forindependent variables.



SW388R7

Data Analysis &

Computers II

Slide 33


Based on the Levene Test, thevariance in "total family income"[income98] is homogeneous forthe categories of "sex" [sex].

The probability associated withthe Levene Statistic (0.031) isgreater than the level of significance, so we fail to rejectthe null hypothesis andconclude that thehomoscedasticity assumption issatisfied.



SW388R7

Data Analysis &

Computers II

Slide 34

Adding a transformed variable

Second, mark thecheckbox for thetransformation wewant to add to thedata set, and clearthe other checkboxes.

First, move the variablethat we want to transform

to the list box of variablesto test.

Even though we do not need atransformation for any of thevariables in this analysis, we willdemonstrate how to use a script,such as the normality script, to add a

transformed variable to the data set,e.g. a logarithmic transformation forhighest year of school.

Fourth, click on theOK button to

produce the output.

Third, clear thecheckbox for Deletetransformed variablesfrom the data. This willsave the transformedvariable.



SW388R7

Data Analysis &

Computers II

Slide 35

The transformed variable in the data editor

If we scroll to the extremeright in the data editor, wesee that the transformedvariable has been added tothe data set.

Whenever we addtransformed variables tothe data set, we should besure to delete them beforestarting another analysis.



SW388R7

Data Analysis &

Computers II

Slide 36

The regression to identify outliers

We use the regression procedureto identify both univariate andmultivariate outliers.

We start with the same dialog weused for the last analysis, in whichincome98 as the dependentvariable and sex, earnrs, and

rincom98 were the independentvariables.

Third, we want to save thecalculated values of the outlierstatistics to the data set.

Click on the Save… button tospecify what we want to save.

First, we substitute thelogarithmic transformation of earnrs, logearn, into the listof independent variables.

Second, we change themethod of entry fromStepwise to Enter so that allvariables will be included inthe detection of outliers.



SW388R7

Data Analysis &

Computers II

Slide 37

Saving the measures of outliers

Second, mark the checkbox forMahalanobis in the Distances panel. This will compute

Mahalanobis distances for theset of independent variables.

Third, click onthe OK button tocomplete thespecifications.

First, mark the checkbox forStudentized residuals in theResiduals panel. Studentizedresiduals are z-scores computedfor a case based on the data forall other cases in the data set.



SW388R7

Data Analysis &

Computers II

Slide 38

The variables for identifying outliers

The variables for identifyingunivariate outliers for thedependent variable are in acolumn which SPSS hasnames sre_1.

The variables for identifyingmultivariate outliers for theindependent variables are ina column which SPSS hasnames mah_1.



SW388R7

Data Analysis &

Computers II

Slide 39

Computing the probability for Mahalanobis D²

To compute the probabilityof D², we will use an SPSSfunction in a Computecommand.

First, select theCompute… commandfrom the Transform menu.



SW388R7

Data Analysis &

Computers II

Slide 40

Formula for probability for Mahalanobis D²

Third, click on the OK buttonto signal completion of thecomputer variable dialog.

Second, to complete thespecifications for the CDF.CHISQfunction, type the name of thevariable containing the D² scores,mah_1, followed by a comma,followed by the number of variablesused in the calculations, 3.

Since the CDF function (cumulativedensity function) computes thecumulative probability from the leftend of the distribution up through agiven value, we subtract it from 1 toobtain the probability in the upper tailof the distribution.

First, in the target variable text box, type thename "p_mah_1" as an acronym for the probabilityof the mah_1, the Mahalanobis D² score.



SW388R7

Data Analysis &

Computers II

Slide 41

Multivariate outliers

Using the probabilities computed in p_mah_1to identify outliers, scroll down through the list

of case to see if we can find cases with aprobability less than 0.001.

There are no outliers for the set of independent variables.



SW388R7

Data Analysis &

Computers II

Slide 42

Univariate outliers

Similarly, we can scroll down the values of sre_1, the studentized residual to see theone outlier with a value larger than ± 3.0.

Based on these criteria, there are 4outliers.There are 4 cases that have a scoreon the dependent variable that issufficiently unusual to be considered outliers

(case 20000357: studentizedresidual=3.08; case 20000416: studentizedresidual=3.57; case 20001379: studentizedresidual=3.27; case 20002702: studentizedresidual=-3.23).



SW388R7

Data Analysis &

Computers II

Slide 43

Omitting the outliers

To omit the outliers from theanalysis, we select in thecases that are not outliers.

First, select theSelect Cases…command from theTransform menu.



SW388R7

Data Analysis &

Computers II

Slide 44

Specifying the condition to omit outliers

First, mark the If condition is satisfiedoption button toindicate that we willenter a specificcondition forincluding cases.

Second, click on theIf … button to specifythe criteria for inclusionin the analysis.



SW388R7

Data Analysis &

Computers II

Slide 45

The formula for omitting outliers

To eliminate the outliers, we

request the cases that are notoutliers.

The formula specifies that weshould include cases if thestudentized residual (regardless of sign) if less than 3 and theprobability for Mahalanobis D² ishigher than the level of

significance, 0.001.

After typing in the formula,click on the Continue buttonto close the dialog box,



SW388R7

Data Analysis &

Computers II

Slide 46

Completing the request for the selection

To complete therequest, we click onthe OK button.



SW388R7

Data Analysis &

Computers II

Slide 47

The omitted multivariate outlier

SPSS identifies the excluded cases bydrawing a slash mark through the casenumber. Most of the slashes are forcases with missing data, but we also seethat the case with the low probability forMahalanobis distance is included inthose that will be omitted.



SW388R7

Data Analysis &

Computers II

Slide 48

Running the regression without outliers

We run the regression again,excluding the outliers.Select the Regression |Linear command from the

Analyze menu.



SW388R7

Data Analysis &

Computers II

Slide 49

Opening the save options dialog

We specify the dependentand independent variables,substituting any transformedvariables required byassumptions.

On our last run, weinstructed SPSS to savestudentized residuals andMahalanobis distance. To

prevent these values frombeing calculated again, clickon the Save… button.

When we used regression todetect outliers, we enteredall variables. Now we aretesting the relationshipspecified in the problem, sowe change the method toStepwise.



SW388R7

Data Analysis &

Computers II

Slide 50

Clearing the request to save outlier data

First, clear the checkboxfor Studentized residuals.

Third, click onthe OK button tocomplete the

specifications.

Second, clear thecheckbox formMahalanobis distance.



SW388R7

Data Analysis &

Computers II

Slide 51

Opening the statistics options dialog

Once we have removed outliers,we need to check the samplesize requirement for regression.

Since we will need thedescriptive statistics for this,

click on the Statistics… button.



SW388R7

Data Analysis &

Computers II

Slide 52

Requesting descriptive statistics

First, mark the checkboxfor Descriptives.

Second, click onthe Continue button tocomplete thespecifications.



SW388R7

Data Analysis &

Computers II

Slide 53

Requesting the output

Having specified theoutput needed for theanalysis, we click onthe OK button to obtainthe regression output.

S 388R7



SW388R7

Data Analysis &

Computers II

Slide 54

Descriptiv e Statistics

17.09 4.073 159

1.55 .499 159

13.76 5.133 159

.424896 .1156559 159

TOTAL FAMILY INCOME

RESPONDENTS SEX

RESPONDENTS INCOME

Logarithm of EARNRS

[LG10( 1+EARNRS)]

M ean Std. Devi ation N

Sample size requirement

The minimum ratio of valid cases to independentvariables for stepwise multiple regression is 5 to1. After removing 4 outliers, there are 159 validcases and 3 independent variables.

The ratio of cases to independent variables for thisanalysis is 53.0 to 1, which satisfies the minimumrequirement. In addition, the ratio of 53.0 to 1satisfies the preferred ratio of 50 to 1.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 55

ANOVAd

1122.398 1 1122.398 117.541 .000a

1499.187 157 9.549

2621.585 158

1572.722 2 786.361 116.957 .000b

1048.863 156 6.723

2621.585 158

1623.976 3 541.325 84.107 .000c

997.609 155 6.436

2621.585 158

Regression

Residual

Total

Regression

Residual

Total

Regression

Residual

Total

Model

1

2

3

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), RESPONDENTS INCOMEa.

Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(

1+EARNRS)]

b.

Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(

1+EARNRS)], RESPONDENTS SEX

c.

Dependent Variable: TOTAL FAMILY INCOMEd.

Significance of regression relationship

The probability of the F statistic (84.107) for the regressionrelationship which includes these variables is <0.001, lessthan or equal to the level of significance of 0.01. We rejectthe null hypothesis that there is no relationship betweenthe best subset of independent variables and the dependentvariable (R² = 0).

We support the research hypothesis that there is astatistically significant relationship between the best subsetof independent variables and the dependent variable.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 56

Model Summary

.654a .428 .424 3.090

.775b .600 .595 2.593

.787c .619 .612 2.537

Model

1

2

3

R R Square

Adj usted

R Square

Std. Error of

the Estimate

Predictors: (Constant), RESPONDENTS INCOMEa.

Predictors: (Constant), RESPONDENTS INCOME,

Logarithm of EARNRS [LG10( 1+EARNRS)]

b.

Predictors: (Constant), RESPONDENTS INCOME,

Logarithm of EARNRS [LG10( 1+EARNRS)],

RESPONDENTS SEX

c.

Increase in proportion of variance

Prior to any transformations of variables to satisfythe assumptions of multiple regression or removalof outliers, the proportion of variance in thedependent variable explained by the independentvariables (R²) was 51.1%.

After transformed variables were substituted tosatisfy assumptions and outliers were removedfrom the sample, the proportion of varianceexplained by the regression analysis was 61.9%, a

difference of 10.8%.

The answer to the questionis true with caution.

A caution is added becauseof the inclusion of ordinallevel variables.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 57

Problem 2

In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problem

with missing data. Use a level of significance of 0.05 for the regression

analysis. Use a level of significance of 0.01 for evaluating assumptions.

The research question requires us to examine the relationship of "age"

[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].

After substituting transformed variables to satisfy regression

assumptions and removing outliers, the proportion of variance

explained by the regression analysis increased by 3.6%.

1. True


3. False


SW388R7



SW388R7

Data Analysis &

Computers II

Slide 58


In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problemwith missing data. Use a level of significance of 0.05 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.



After substituting transformed variables to satisfy regressionassumptions and removing outliers, the proportion of varianceexplained by the regression analysis increased by 3.6%.

1. True


3. False


The problem may give us different

levels of significance for the analysis.

In this problem, we are told to use0.05 as alpha for the regressionanalysis and the more conservative0.01 as the alpha in testingassumptions.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 59






1. True


3. False


The method for selecting variables isderived from the research question.

If we are asked to examine a relationship

without any statement about controlvariables or the best subset of variables, wedo a standard multiple regression.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 60


The research question requires us to examine the relationship of "age"[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].


1. True


3. False



The purpose of testing for assumptions and outliers is toidentify a stronger model. The main question to beanswered in this problem is whether or not the usetransformed variables to satisfy assumptions and theremoval of outliers improves the overall relationshipbetween the independent variables and the dependentvariable, as measured by R².

Specifically, the question asks whether ornot the R² for a regression analysis aftersubstituting transformed variables andeliminating outliers is 3.6% higher than aregression analysis using the original formatfor all variables and including all cases.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 61


To start out, we run astandard multipleregression analysis withprestg80 as the dependentvariable and age, educ, andsex as the independentvariables.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 62


Prior to any transformations of variablesto satisfy the assumptions of multiple

regression or removal of outliers, theproportion of variance in the dependentvariable explained by the independentvariables (R²) was 27.1%. This is thebenchmark that we will use to evaluatethe utility of transformations and theelimination of outliers.

For this particular question, we are not interested in thestatistical significance the overall relationship prior totransformations and removing outliers. In fact, it ispossible that the relationship is not statistically significantdue to variables that are not normal, relationships thatare not linear, and the inclusion of outliers.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 63

Normality of the dependent variable

In evaluating assumptions, the first step is toexamine the normality of the dependentvariable. If it is not normally distributed, or

cannot be normalized with a transformation, itcan affect the relationships with all othervariables.

To test the normality of the dependentvariable, run the script:NormalityAssumptionAndTransformations.SBS

Second, click on theOK button toproduce the output.

First, move thedependent variablePRESTG80 to the listbox of variables to test.

SW388R7



SW388R7

Data Analysis &

Computers II

Slide 64

Normality of the dependent variable

The dependent variable "occupational prestigescore" [prestg80] satisfies the criteria for anormal distribution. The skewness (0.401) andkurtosis (-0.630) were both between -1.0 and+1.0. No transformation is necessary.

SW388R7



Data Analysis &

Computers II

Slide 65

Normality of independent variable: Age

After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship withthe dependent variable.

To test the normality of age, run the script:NormalityAssumptionAndTransformations.SBS


First, move theindependent variableAGE to the list box of variables to test.

SW388R7



Data Analysis &

Computers II

Slide 66

Descriptives

45.99 1.023

43.98

48.00

45.31

43.50

282.465

16.807

19

89

70

24.00

.595 .148

-.351 .295

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Devia tion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

AGE OF RESPONDENT


Normality of independent variable: Age

The independent variable "age" [age] satisfies the criteria for theassumption of normality, but does not satisfy the assumption of linearity with the dependent variable "occupational prestige score"[prestg80].

In evaluating normality, the skewness (0.595) and kurtosis (-0.351)were both within the range of acceptable values from -1.0 to +1.0.

SW388R7



Data Analysis &

Computers II

Slide 67

Linearity and independent variable: Age

To evaluate the linearity of the relationship

between age and occupational prestige, runthe script for the assumption of linearity:



First, move the dependent variablePRESTG80 to the text box for thedependent variable.

Second, move theindependent variable,AGE, to the list box forindependent variables.

SW388R7



Data Analysis &

Computers II

Slide 68

Correlations

1 .024 .059 -.004

. .706 .348 .956

255 255 255 255

.024 1 .979** .983**

.706 . .000 .000

255 270 270 270

.059 .979** 1 .926**

.348 .000 . .000

255 270 270 270

-.004 .983** .926** 1

.956 .000 .000 .

255 270 270 270

.041 .995** .994** .960**

.518 .000 .000 .000

255 270 270 270

.096 .916** .978** .832**

.128 .000 .000 .000

255 270 270 270

Pearson Correla tion

Sig. (2-tai led)

N


Sig. (2-tai led)

N


Sig. (2-tai led)

N


Sig. (2-tai led)

N


Sig. (2-tai led)

N


Sig. (2-tai led)

N

RS OCCUPATIONAL

PRESTIGE SCORE

(1980)

AGE OF RESPONDENT

Logarithm of AGE

[LG10(AGE)]

Square of AGE [(AGE)**2]

Square Root of AGE

[SQRT(AGE)]

Inverse of AGE [-1/(AGE)]

RS

OCCUPA

TIONAL

PRESTIG

E SCORE

(1980)

AGE OF

RESPON

DENT

Logarithm of

AGE

[LG10(AGE)]

Square of

AGE

[(AGE)**2]

S

[S

Linearity and independent variable: Age

The evidence of nonlinearity in therelationship between the independentvariable "age" [age] and the dependentvariable "occupational prestige score"[prestg80] was the lack of statisticalsignificance of the correlation coefficient(r = 0.024). The probability for thecorrelation coefficient was 0.706, greaterthan the level of significance of 0.01. Wecannot reject the null hypothesis that r = 0,and cannot conclude that there is a linearrelationship between the variables.

Since none of the transformations toimprove linearity were successful, it is an

indication that the problem may be a weakrelationship, rather than a curvilinearrelationship correctable by using atransformation. A weak relationship is not aviolation of the assumption of linearity, anddoes not require a caution.

SW388R7



Data Analysis &

Computers II

Slide 69

Transformation for Age

The independent variable age satisfied the criteria

for normality.

The independent variable age did not have a linear

relationship to the dependent variable occupational

prestige. However, none of the transformations

linearized the relationship.

No transformation will be used - it would not help

linearity and is not needed for normality.

SW388R7 Linearity and independent variable:



Data Analysis &

Computers II

Slide 70

Linearity and independent variable:Highest year of school completed

To evaluate the linearity of the relationshipbetween highest year of school and

occupational prestige, run the script for theassumption of linearity:




Second, move theindependent variable,EDUC, to the list boxfor independentvariables.

SW388R7 Linearity and independent variable:



Data Analysis &

Computers II

Slide 71

Correlations

1 .495** -.512** .528**

. .000 .000 .000

255 254 254 254.495** 1 -.920** .980**

.000 . .000 .000

254 269 269 269

-.512** -.920** 1 -.969**

.000 .000 . .000

254 269 269 269

.528** .980** -.969** 1

.000 .000 .000 .

254 269 269 269

-.518** -.982** .977** -.997**

.000 .000 .000 .000

254 269 269 269

-.423** -.699** .915** -.789**

.000 .000 .000 .000

254 269 269 269

Pearson Correlat ion

Sig. (2-tai led)

NPearson Correlat ion

Sig. (2-tai led)

N


Sig. (2-tai led)

N

Pearson Correlat ionSig. (2-tai led)

N


Sig. (2-tai led)

N


Sig. (2-tai led)

N

RS OCCUPATIONAL

PRESTIGE SCORE

(1980)

HIGHEST YEAR OF

SCHOOL COMPLETED

Logarithm of EDUC

[LG10( 21-EDUC)]

Square of EDUC[(EDUC)**2]

Square Root of EDUC

[SQRT( 21-EDUC)]

Inverse of EDUC [-1/(

21-EDUC)]

RS

OCCUPA

TIONAL

PRESTIG

E SCORE

(1980)

HIGHEST

YEAR OF

SCHOOL

COMPLETED

Logarithm of

EDUC [LG10(

21-EDUC)]

Square of

EDUC

[(EDUC)**2]

**

Linearity and independent variable:Highest year of school completed

The independent variable "highest yearof school completed" [educ] satisfiesthe criteria for the assumption of linearity with the dependent variable"occupational prestige score"

[prestg80], but does not satisfy theassumption of normality. The evidenceof linearity in the relationship betweenthe independent variable "highest yearof school completed" [educ] and thedependent variable "occupationalprestige score" [prestg80] was thestatistical significance of the correlationcoefficient (r = 0.495). The probability

for the correlation coefficient was<0.001, less than or equal to the levelof significance of 0.01. We reject thenull hypothesis that r = 0 and concludethat there is a linear relationshipbetween the variables.

SW388R7

D A l i & Normality of independent variable:



Data Analysis &

Computers II

Slide 72

Normality of independent variable:Highest year of school completed


First, move the

dependent variableEDUC to the list box of variables to test.

To test the normality of EDUC, Highest year

of school completed, run the script:

NormalityAssumptionAndTransformations.SBS

SW388R7

D t A l i & Normality of independent variable:



Data Analysis &

Computers II

Slide 73

Descriptiv es

13.12 .179

12.77

13.47

13.14

13.00

8.5832.930

2

20

18

3.00

-.137 .149

1.246 .296

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

VarianceStd. Devia tion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HIGHEST YEAR OF

SCHOOL COMPLETED


Normality of independent variable:Highest year of school completed

In evaluating normality, the skewness (-0.137) was between -1.0and +1.0, but the kurtosis (1.246) was outside the range from -1.0to +1.0. None of the transformations for normalizing the distributionof "highest year of school completed" [educ] were effective.

SW388R7

D t A l i &

T f i f hi h f h l



Data Analysis &

Computers II

Slide 74

Transformation for highest year of school

The independent variable, highest year of school,had a linear relationship to the dependent variable,occupational prestige.

The independent variable, highest year of school, didnot satisfy the criteria for normality. None of thetransformations for normalizing the distribution of "highest year of school completed" [educ] wereeffective.

No transformation will be used - it would not helpnormality and is not needed for linearity. A cautionshould be added to any findings.

SW388R7

Data Analysis &

H d ti it



Data Analysis &

Computers II

Slide 75


To evaluate the homoscedasticity of the

relationship between sex and occupationalprestige, run the script for the assumption of homogeneity of variance:

HomoscedasticityAssumptionAndTransformations.SBS



Second, move theindependent variable,SEX, to the list box forindependent variables.

SW388R7

Data Analysis &

H d ti it



Data Analysis &

Computers II

Slide 76


Based on the Levene Test, thevariance in "occupationalprestige score" [prestg80] ishomogeneous for the categoriesof "sex" [sex]. The probability

associated with the LeveneStatistic (0.808) is greater thanthe level of significance, so wefail to reject the null hypothesisand conclude that thehomoscedasticity assumption issatisfied.

Even if we violate the

assumption, we would not do atransformation since it couldimpact the relationships of theother independent variableswith the dependent variable.

SW388R7

Data Analysis &

Addi t f d i bl



Data Analysis &

Computers II

Slide 77

Adding a transformed variable

Second, mark thecheckbox for thetransformation wewant to add to thedata set, and clearthe other checkboxes.

First, move the variable

that we want to transformto the list box of variablesto test.

Even though we do not need atransformation for any of thevariables in this analysis, we willdemonstrate how to use a script,such as the normality script, to add atransformed variable to the data set,

e.g. a logarithmic transformation forhighest year of school.

Fourth, click on theOK button toproduce the output.

Third, clear thecheckbox for Deletetransformed variablesfrom the data. This willsave the transformedvariable.

SW388R7

Data Analysis &

Th t f d i bl i th d t dit



Data Analysis &

Computers II

Slide 78

The transformed variable in the data editor

If we scroll to the extremeright in the data editor, wesee that the transformedvariable has been added tothe data set.

Whenever we addtransformed variables tothe data set, we should besure to delete them beforestarting another analysis.

SW388R7

Data Analysis &

Th i t id tif tli



Data Analysis &

Computers II

Slide 79

The regression to identify outliers

We can use the regressionprocedure to identify bothunivariate and multivariateoutliers.

We start with the same dialog weused for the last analysis, in whichprestg90 as the dependentvariable and age, educ, and sex

were the independent variables.

If we need to use any transformedvariables, we would substitutethem now.

We will save the calculatedvalues of the outlierstatistics to the data set.

Click on the Save… button tospecify what we want tosave.

SW388R7

Data Analysis &

S i g th f tli



Data Analysis &

Computers II

Slide 80

Saving the measures of outliers

Second, mark the checkbox forMahalanobis in the Distances panel. This will compute

Mahalanobis distances for theset of independent variables.


specifications.

First, mark the checkbox forStudentized residuals in theResiduals panel. Studentizedresiduals are z-scores computedfor a case based on the data forall other cases in the data set.

SW388R7

Data Analysis &




Data Analysis &

Computers II

Slide 81


The variables for identifyingunivariate outliers for thedependent variable are in acolumn which SPSS hasnames sre_1.

The variables for identifyingmultivariate outliers for theindependent variables are ina column which SPSS hasnames mah_1.

SW388R7

Data Analysis &




Data Analysis &

Computers II

Slide 82


To compute the probabilityof D², we will use an SPSSfunction in a Computecommand.

First, select theCompute… commandfrom the Transform menu.

SW388R7

Data Analysis &




y

Computers II

Slide 83


Third, click on the OK buttonto signal completion of thecomputer variable dialog.

Second, to complete thespecifications for the CDF.CHISQfunction, type the name of thevariable containing the D² scores,mah_1, followed by a comma,followed by the number of variablesused in the calculations, 3.

Since the CDF function (cumulativedensity function) computes thecumulative probability from the leftend of the distribution up through agiven value, we subtract it from 1 toobtain the probability in the upper tailof the distribution.

First, in the target variable text box, type thename "p_mah_1" as an acronym for the probabilityof the mah_1, the Mahalanobis D² score.

SW388R7

Data Analysis &

The multivariate outlier



y

Computers II

Slide 84

The multivariate outlier

Using the probabilities computed in p_mah_1to identify outliers, scroll down through the listof case to see the one case with a probabilityless than 0.001.

There is 1 case that has a combination of scores on the independent variables that issufficiently unusual to be considered an outlier(case 20001984: Mahalanobis D²=16.97,p=0.0007).

SW388R7

Data Analysis &

The univariate outlier



y

Computers II

Slide 85

The univariate outlier

Similarly, we can scroll down the values of sre_1, the studentized residual to see theone outlier with a value larger than 3.0.

There is 1 case that has a score on thedependent variable that is sufficientlyunusual to be considered an outlier (case20000391: studentized residual=4.14).

SW388R7

Data Analysis &




Computers II

Slide 86


To omit the outliers from theanalysis, we select in thecases that are not outliers.

First, select theSelect Cases…command from theTransform menu.

SW388R7

Data Analysis &




Computers II

Slide 87


First, mark the If

condition is satisfiedoption button toindicate that we willenter a specificcondition forincluding cases.

Second, click on theIf … button to specifythe criteria for inclusionin the analysis.

SW388R7

Data Analysis &




Computers II

Slide 88


To eliminate the outliers, werequest the cases that are notoutliers.

The formula specifies that weshould include cases if thestudentized residual (regardless of sign) if less than 3 and theprobability for Mahalanobis D² ishigher than the level of

significance, 0.001.

After typing in the formula,click on the Continue buttonto close the dialog box,

SW388R7

Data Analysis &




Computers II

Slide 89


To complete therequest, we click onthe OK button.

SW388R7

Data Analysis &

C t II The omitted multivariate outlier



Computers II

Slide 90

The omitted multivariate outlier

SPSS identifies the excluded cases by

drawing a slash mark through the casenumber. Most of the slashes are forcases with missing data, but we also seethat the case with the low probability forMahalanobis distance is included inthose that will be omitted.

SW388R7

Data Analysis &

C t II Running the regression without outliers



Computers II

Slide 91

Running the regression without outliers

We run the regression again,excluding the outliers.Select the Regression |Linear command from the

Analyze menu.

SW388R7

Data Analysis &

Comp ters II Opening the save options dialog



Computers II

Slide 92

Opening the save options dialog

If specify the dependent anindependent variables. If we wanted to use anytransformed variables wewould substitute them now.

On our last run, weinstructed SPSS to savestudentized residuals andMahalanobis distance. Toprevent these values frombeing calculated again, click

on the Save… button.

SW388R7

Data Analysis &

Computers II Clearing the request to save outlier data



Computers II

Slide 93

Clearing the request to save outlier data

First, clear the checkboxfor Studentized residuals.


specifications.

Second, clear thecheckbox formMahalanobis distance.

SW388R7

Data Analysis &

Computers II Opening the statistics options dialog



Computers II

Slide 94

Opening the statistics options dialog

Once we have removed outliers,we need to check the samplesize requirement for regression.

Since we will need thedescriptive statistics for this,

click on the Statistics… button.

SW388R7

Data Analysis &

Computers II Requesting descriptive statistics



Computers II

Slide 95

Requesting descriptive statistics

First, mark the checkboxfor Descriptives.

Second, click onthe Continue button tocomplete thespecifications.

SW388R7

Data Analysis &

Computers II Requesting the output



Computers II

Slide 96

Requesting the output

Having specified the

output needed for theanalysis, we click onthe OK button to obtainthe regression output.

SW388R7

Data Analysis &

Computers II Sample size requirement



Computers II

Slide 97

Sample size requirement

The minimum ratio of valid cases to independentvariables for multiple regression is 5 to 1. Afterremoving 2 outliers, there are 252 valid cases and3 independent variables.

The ratio of cases to independent variables for thisanalysis is 84.0 to 1, which satisfies the minimumrequirement. In addition, the ratio of 84.0 to 1

satisfies the preferred ratio of 15 to 1.

SW388R7

Data Analysis &

Computers II Significance of regression relationship



Computers II

Slide 98

Significance of regression relationship

The probability of the F statistic (36.639) for theoverall regression relationship is <0.001, less thanor equal to the level of significance of 0.05. Wereject the null hypothesis that there is norelationship between the set of independentvariables and the dependent variable (R² = 0).

We support the research hypothesis that there is a

statistically significant relationship between theset of independent variables and the dependentvariable.

SW388R7

Data Analysis &

Computers II Increase in proportion of variance



Computers II

Slide 99

Increase in proportion of variance

Prior to any transformations of variables to satisfy

the assumptions of multiple regression or removalof outliers, the proportion of variance in thedependent variable explained by the independentvariables (R²) was 27.1%. No transformedvariables were substituted to satisfy assumptions,but outliers were removed from the sample.

The proportion of variance explained by theregression analysis after removing outliers was30.7%, a difference of 3.6%.

The answer to the questionis true with caution.

A caution is added becauseof a violation of regressionassumptions.

SW388R7

Data Analysis &

Computers II Impact of assumptions and outliers - 1



C p

Slide 100

Impact of assumptions and outliers 1

The following is a guide to the decision process for answeringproblems about the impact of assumptions and outliers on analysis:

Inappropriate

application of a statistic

Yes

NoDependent variablemetric?Independent variablesmetric or dichotomous?

Run baseline regression andrecord R² for futurereference, using method forincluding variables identifiedin the research question.

Yes

Ratio of cases toindependent variables atleast 5 to 1?

Yes

No Inappropriateapplication of a statistic

SW388R7

Data Analysis &




p

Slide 101

Impact of assumptions and outliers 2

Is the dependent variablenormally distributed?

Yes

No

Try:1. Logarithmic transformation2. Square root transformation3. Inverse transformation

If unsuccessful, add caution

Metric IV’s normallydistributed and linearlyrelated to DV

Yes

No

Try:1. Logarithmic transformation2. Square root transformation(3. Square transformation)4. Inverse transformation

If unsuccessful, add caution

DV is homoscedastic forcategories of dichotomous IV’s?

Yes

NoAdd caution

SW388R7

Data Analysis &




p

Slide 102

pact o assu pt o s a d outl e s 3

Are there univariateoutliers (DV) ormultivariate outliers

(IVs)?

No

Yes

Ratio of cases toindependent variables atleast 5 to 1?

No Inappropriateapplication of

a statistic

Yes

Run regression again usingtransformed variables andeliminating outliers

Remove outliers from data

Substituting any transformed variables, run

regression using direct entry to include allvariables to request statistics for detectingoutliers

SW388R7

Data Analysis &




Slide 103

p p

Probability of ANOVA test of

regression less than/equal to

level of significance?

Yes

NoFalse

Increase in R² correct?

NoFalse

Yes

Yes

Yes

Satisfies ratio for preferred

sample size: 15 to 1

(stepwise: 50 to 1)

Yes

NoTrue with caution

SW388R7

Data Analysis &




Slide 104

p p

Yes

Other cautions added for

ordinal variables or violation

of assumptions?

Yes

No

True with caution

True

Yes

Date post:	30-Oct-2015
Category:	Documents
Upload:	syed-asim-sajjad
View:	8 times
Download:	0 times

MultipleRegression_AssumptionsAndOUtliers

Documents