Date post: | 30-Oct-2015 |
Category: |
Documents |
Upload: | syed-asim-sajjad |
View: | 8 times |
Download: | 0 times |
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 1/104
SW388R7
Data Analysis &
Computers II
Slide 1
Multiple Regression – Assumptions andOutliers
Multiple Regression and Assumptions
Multiple Regression and Outliers
Strategy for Solving Problems
Practice Problems
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 2/104
SW388R7
Data Analysis &
Computers II
Slide 2
Multiple Regression and Assumptions
Multiple regression is most effect at identifying
relationship between a dependent variable and a
combination of independent variables when its
underlying assumptions are satisfied: each of the
metric variables are normally distributed, the
relationships between metric variables are linear,
and the relationship between metric and
dichotomous variables is homoscedastic.
Failing to satisfy the assumptions does not mean that
our answer is wrong. It means that our solution may
under-report the strength of the relationships.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 3/104
SW388R7
Data Analysis &
Computers II
Slide 3
Multiple Regression and Outliers
Outliers can distort the regression results. When an
outlier is included in the analysis, it pulls the
regression line towards itself. This can result in a
solution that is more accurate for the outlier, but
less accurate for all of the other cases in the data
set.
We will check for univariate outliers on the
dependent variable and multivariate outliers on the
independent variables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 4/104
SW388R7
Data Analysis &
Computers II
Slide 4
Relationship between assumptions and outliers
The problems of satisfying assumptions and detecting
outliers are intertwined. For example, if a case has
a value on the dependent variable that is an outlier,
it will affect the skew, and hence, the normality of
the distribution.
Removing an outlier may improve the distribution of
a variable.
Transforming a variable may reduce the likelihood
that the value for a case will be characterized as an
outlier.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 5/104
SW388R7
Data Analysis &
Computers II
Slide 5
Order of analysis is important
The order in which we check assumptions and detect
outliers will affect our results because we may get a
different subset of cases in the final analysis.
In order to maximize the number of cases available
to the analysis, we will evaluate assumptions first.
We will substitute any transformations of variable
that enable us to satisfy the assumptions.
We will use any transformed variables that are
required in our analysis to detect outliers.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 6/104
SW388R7
Data Analysis &
Computers II
Slide 6
Strategy for solving problems
1. Run type of regression specified in problem statement on variablesusing full data set.
2. Test the dependent variable for normality. If it does not satisfy thecriteria for normality unless transformed, substitute the transformedvariable in the remaining tests that call for the use of the dependentvariable.
3. Test for normality, linearity, homoscedasticity using scripts. Decidewhich transformations should be used.
4. Substitute transformations and run regression entering all
independent variables, saving studentized residuals and Mahalanobisdistance scores. Compute probabilities for D².
5. Remove the outliers (studentized residual greater than 3 orMahalanobis D² with p <= 0.001), and run regression with the methodand variables specified in the problem.
6. Compare R² for analysis using transformed variables and omittingoutliers (step 5) to R² obtained for model using all data and original
variables (step 1).
Our strategy for solving problems about violations of assumptions and outliers will include the following steps:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 7/104
SW388R7
Data Analysis &
Computers II
Slide 7
Transforming dependent variables
If dependent variable is not normally distributed:
Try log, square root, and inverse transformation.Use first transformed variable that satisfiesnormality criteria.
If no transformation satisfies normality criteria,
use untransformed variable and add caution forviolation of assumption.
If a transformation satisfies normality, use thetransformed variable in the tests of the independent
variables.
We will use the following logic to transform variables:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 8/104
SW388R7
Data Analysis &
Computers II
Slide 8
Transforming independent variables - 1
If independent variable is normally distributed andlinearly related to dependent variable, use as is.
If independent variable is normally distributed butnot linearly related to dependent variable:
Try log, square root, square, and inversetransformation. Use first transformed variablethat satisfies linearity criteria and does not
violate normality criteria If no transformation satisfies linearity criteria and
does not violate normality criteria, useuntransformed variable and add caution forviolation of assumption
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 9/104
SW388R7
Data Analysis &
Computers II
Slide 9
Transforming independent variables - 2
If independent variable is linearly related to
dependent variable but not normally distributed:
Try log, square root, and inverse transformation.
Use first transformed variable that satisfiesnormality criteria and does not reduce correlation.
Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria and has significant correlation.
If no transformation satisfies normality criteria
with a significant correlation, use untransformed
variable and add caution for violation of
assumption
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 10/104
SW388R7
Data Analysis &
Computers II
Slide 10
Transforming independent variables - 3
If independent variable is not linearly related to
dependent variable and not normally distributed:
Try log, square root, square, and inverse
transformation. Use first transformed variablethat satisfies normality criteria and has significant
correlation.
If no transformation satisfies normality criteria
with a significant correlation, used untransformed
variable and add caution for violation of
assumption
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 11/104
SW388R7
Data Analysis &
Computers II
Slide 11
Impact of transformationsand omitting outliers
We evaluate the regression assumptions and detect
outliers with a view toward strengthening the
relationship.
This may not happen. The regression may be the
same, it may be weaker, and it may be stronger. We
cannot be certain of the impact until we run the
regression again.
In the end, we may opt not to exclude outliers and
not to employ transformations; the analysis informs
us of the consequences of doing either.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 12/104
SW388R7
Data Analysis &
Computers II
Slide 12
Notes
Whenever you start a new problem, make sure youhave removed variables created for previous analysisand have included all cases back into the data set.
I have added the square transformation to thecheckboxes for transformations in the normalityscript. Since this is an option for linearity, we needto be able to evaluate its impact on normality.
If you change the options for output in pivot tablesfrom labels to names, you will get an error messagewhen you use the linearity script. To solve theproblem, change the option for output in pivot tables
back to labels.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 13/104
SW388R7
Data Analysis &
Computers II
Slide 13
Problem 1
In the dataset GSS2000.sav, is the following statement true, false, or anincorrect application of a statistic? Assume that there is no problem withmissing data. Use a level of significance of 0.01 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how manyin family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptionsand removing outliers, the total proportion of variance explained by theregression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 14/104
SW388R7
Data Analysis &
Computers II
Slide 14
Dissecting problem 1 - 1
In the dataset GSS2000.sav, is the following statement true, false, or anincorrect application of a statistic? Assume that there is no problem withmissing data. Use a level of significance of 0.01 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how manyin family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptionsand removing outliers, the total proportion of variance explained by theregression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The problem may give us different
levels of significance for the analysis.
In this problem, we are told to use0.01 as alpha for the regressionanalysis as well as for testingassumptions.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 15/104
SW388R7
Data Analysis &
Computers II
Slide 15
Dissecting problem 1 - 2
In the dataset GSS2000.sav, is the following statement true, false, or anincorrect application of a statistic? Assume that there is no problem withmissing data. Use a level of significance of 0.01 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how manyin family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptionsand removing outliers, the total proportion of variance explained by theregression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The method for selecting variables isderived from the research question.
In this problem we are asked to idnetify thebest subset of predicotrs, so we do astepwise multiple regression.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 16/104
SW388R7
Data Analysis &
Computers II
Slide 16
Dissecting problem 1 - 3
In the dataset GSS2000.sav, is the following statement true, false, or anincorrect application of a statistic? Assume that there is no problem withmissing data. Use a level of significance of 0.01 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how manyin family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptionsand removing outliers, the total proportion of variance explained by theregression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The purpose of testing for assumptions and outliers is toidentify a stronger model. The main question to beanswered in this problem is whether or not the usetransformed variables to satisfy assumptions and theremoval of outliers improves the overall relationshipbetween the independent variables and the dependentvariable, as measured by R².
Specifically, the question asks whether ornot the R² for a regression analysis aftersubstituting transformed variables andeliminating outliers is 10.8% higher than aregression analysis using the original formatfor all variables and including all cases.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 17/104
SW388R7
Data Analysis &
Computers II
Slide 17
R² before transformations or removing outliers
To start out, we run astepwise multiple regressionanalysis with income98 asthe dependent variable andsex, earnrs, and rincom98as the independent
variables.
We select stepwise asthe method to select thebest subset of predictors.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 18/104
SW388R7
Data Analysis &
Computers II
Slide 18
R² before transformations or removing outliers
Prior to any transformations of variablesto satisfy the assumptions of multipleregression or removal of outliers, theproportion of variance in the dependentvariable explained by the independentvariables (R²) was 51.1%. This is thebenchmark that we will use to evaluatethe utility of transformations and the
elimination of outliers.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 19/104
SW388R7
Data Analysis &
Computers II
Slide 19
R² before transformations or removing outliers
For this particular question, we are not interested in thestatistical significance of the overall relationship prior totransformations and removing outliers. In fact, it ispossible that the relationship is not statistically significantdue to variables that are not normal, relationships thatare not linear, and the inclusion of outliers.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 20/104
SW388R7
Data Analysis &
Computers II
Slide 20
Normality of the dependent variable:total family income
In evaluating assumptions, the first step is toexamine the normality of the dependentvariable. If it is not normally distributed, orcannot be normalized with a transformation, itcan affect the relationships with all othervariables.
To test the normality of the dependentvariable, run the script:NormalityAssumptionAndTransformations.SBS
Second, click on theOK button to
produce the output.
First, move thedependent variableINCOME98 to the listbox of variables to test.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 21/104
SW388R7
Data Analysis &
Computers II
Slide 21
Descriptives
15.67 .349
14.98
16.36
15.95
17.0027.951
5.287
1
23
22
8.00
-.628 .161
-.248 .320
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
MedianVariance
Std. Devia tion
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
TOTAL FAMILY INCOM
Statistic Std. Error
Normality of the dependent variable:total family income
The dependent variable "total family income"[income98] satisfies the criteria for a normaldistribution. The skewness (-0.628) and kurtosis(-0.248) were both between -1.0 and +1.0. Notransformation is necessary.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 22/104
SW388R7
Data Analysis &
Computers II
Slide 22
Linearity and independent variable:how many in family earned money
To evaluate the linearity of the relationshipbetween number of earners and total family
income, run the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on theOK button to
produce the output.
First, move the dependent variableINCOME98 to the text box for thedependent variable.
Second, move theindependent variable,EARNRS, to the listbox for independentvariables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 23/104
SW388R7
Data Analysis &
Computers II
Slide 23
Correlations
1 .505**
. .000
229 228
.505** 1
.000 .
228 269
.536** .959**
.000 .000
228 269
.376** .908**
.000 .000
228 269
.527** .989**
.000 .000
228 269
.526** .871**
.000 .000
228 269
Pearson Correlation
Sig. (2-tail ed)
N
Pearson Correlation
Sig. (2-tail ed)
N
Pearson Correlation
Sig. (2-tail ed)
N
Pearson Correlation
Sig. (2-tail ed)
N
Pearson Correlation
Sig. (2-tail ed)
N
Pearson Correlation
Sig. (2-tail ed)
N
TOTAL FAMILY INCOME
HOW MANY IN FAMILY
EARNED MONEY
Logarithm of EARNRS
[LG10( 1+EARNRS)]
Square of EARNRS
[(EARNRS)**2]
Square Root of EARNRS
[SQRT( 1+EARNRS)]
Inverse of EARNRS [-1/(
1+EARNRS)]
TOTAL
FAMILY
INCOME
HOW MANY
IN FAMILY
EARNED
MONEY
L
1
Correlation i s significant at the 0 .01 level (2-tailed).**.
Linearity and independent variable:how many in family earned money
The independent variable "how many infamily earned money" [earnrs] satisfiesthe criteria for the assumption of linearity with the dependent variable"total family income" [income98], but
does not satisfy the assumption of normality. The evidence of linearity inthe relationship between theindependent variable "how many infamily earned money" [earnrs] and thedependent variable "total family income"[income98] was the statisticalsignificance of the correlation coefficient(r = 0.505). The probability for the
correlation coefficient was <0.001, lessthan or equal to the level of significanceof 0.01. We reject the null hypothesisthat r = 0 and conclude that there is alinear relationship between thevariables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 24/104
SW388R7
Data Analysis &
Computers II
Slide 24
Normality of independent variable:how many in family earned money
After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners infamily, run the script:NormalityAssumptionAndTransformations.SBS
Second, click on theOK button to
produce the output.
First, move theindependent variableEARNRS to the list boxof variables to test.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 25/104
SW388R7
Data Analysis &
Computers II
Slide 25
Descriptives
1.43 .061
1.31
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742 .149
1.324 .296
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Devia tion
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
HOW MANY IN FAMIL
EARNED MONEY
Statistic Std. Error
Normality of independent variable:how many in family earned money
The independent variable "how many in family earned money" [earnrs]satisfies the criteria for the assumption of linearity with the dependentvariable "total family income" [income98], but does not satisfy theassumption of normality.
In evaluating normality, the skewness (0.742) was between -1.0 and+1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 26/104
SW388R7
Data Analysis &
Computers II
Slide 26
Normality of independent variable:how many in family earned money
The logarithmictransformationimproves the normalityof "how many in familyearned money" [earnrs]
without a reduction inthe strength of therelationship to "totalfamily income"[income98]. Inevaluating normality,the skewness (-0.483)and kurtosis (-0.309)were both within the
range of acceptablevalues from -1.0 to+1.0. The correlationcoefficient for thetransformed variable is0.536.
The square root transformation alsohas values of skewness and kurtosis inthe acceptable range.
However, by our order of preferencefor which transformation to use, thelogarithm is preferred.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 27/104
SW388R7
Data Analysis &
Computers II
Slide 27
Transformation for how many in familyearned money
The independent variable, how many in family
earned money, had a linear relationship to the
dependent variable, total family income.
The logarithmic transformation improves the
normality of "how many in family earned money"
[earnrs] without a reduction in the strength of the
relationship to "total family income" [income98].
We will substitute the logarithmic transformation of
how many in family earned money in the regression
analysis.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 28/104
SW388R7
Data Analysis &
Computers II
Slide 28
Normality of independent variable:respondent’s income
After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship with
the dependent variable.
To test the normality of respondent’s infamily, run the script:NormalityAssumptionAndTransformations.SBS
Second, click on theOK button to
produce the output.
First, move theindependent variable
RINCOM89 to the listbox of variables totest.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 29/104
SW388R7
Data Analysis &
Computers II
Slide 29
Descriptiv es
13.35 .419
12.52
14.18
13.54
15.00
29.5355.435
1
23
22
8.00
-.686 .187
-.253 .373
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
VarianceStd. Devia tion
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
RESPONDENTS INCOME
Statistic Std. Error
Normality of independent variable:respondent’s income
The independent variable "income" [rincom98] satisfies the criteria forboth the assumption of normality and the assumption of linearity withthe dependent variable "total family income" [income98].
In evaluating normality, the skewness (-0.686) and kurtosis (-0.253)were both within the range of acceptable values from -1.0 to +1.0.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 30/104
SW388R7
Data Analysis &
Computers II
Slide 30
Linearity and independent variable:respondent’s income
To evaluate the linearity of the relationshipbetween respondent’s income and total
family income, run the script for theassumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on theOK button to
produce the output.
First, move the dependent variableINCOME98 to the text box for thedependent variable.
Second, move theindependent variable,RINCOM89, to the listbox for independentvariables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 31/104
SW388R7
Data Analysis &
Computers II
Slide 31
Correlations
1 .577** -.595**
. .000 .000
229 163 163
.577** 1 -.922**
.000 . .000
163 168 168
-.595** -.922** 1
.000 .000 .
163 168 168
.613** .967** -.976**
.000 .000 .000
163 168 168
-.601** -.985** .974**
.000 .000 .000
163 168 168
-.434** -.602** .848**
.000 .000 .000
163 168 168
Pearson Correlation
Sig. (2-tai led)
N
Pearson Correlation
Sig. (2-tai led)
N
Pearson Correlation
Sig. (2-tai led)
N
Pearson Correlation
Sig. (2-tai led)
N
Pearson Correlation
Sig. (2-tai led)
N
Pearson Correlation
Sig. (2-tai led)
N
TOTAL FAMILY INCOME
RESPONDENTS INCOME
Logarithm of RINCOM98
[LG10( 24-RINCOM98)]
Square of RINCOM98
[(RINCOM98)**2]
Square Root of
RINCOM98 [SQRT(
24-RINCOM98)]
Inverse of RINCOM98 [-1/(
24-RINCOM98)]
TOTAL
FAMILY
INCOME
RESPONDEN
TS INCOME
Logarithm of
RINCOM98
[LG10(
24-RINCOM
98)]
Correlation is significant at the 0.01 l evel (2-tailed).**.
Linearity and independent variable:respondent’s income
The evidence of linearity in therelationship between the independentvariable "income" [rincom98] and thedependent variable "total family income"[income98] was the statisticalsignificance of the correlation coefficient(r = 0.577). The probability for thecorrelation coefficient was <0.001, lessthan or equal to the level of significanceof 0.01. We reject the null hypothesis
that r = 0 and conclude that there is alinear relationship between thevariables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 32/104
SW388R7
Data Analysis &
Computers II
Slide 32
Homoscedasticity: sex
To evaluate the homoscedasticity of the
relationship between sex and total familyincome, run the script for the assumption of homogeneity of variance:
HomoscedasticityAssumptionAndTransformations.SBS
Third, click on theOK button toproduce the output.
First, move the dependent variableINCOME98 to the text box for thedependent variable.
Second, move theindependent variable,SEX, to the list box forindependent variables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 33/104
SW388R7
Data Analysis &
Computers II
Slide 33
Homoscedasticity: sex
Based on the Levene Test, thevariance in "total family income"[income98] is homogeneous forthe categories of "sex" [sex].
The probability associated withthe Levene Statistic (0.031) isgreater than the level of significance, so we fail to rejectthe null hypothesis andconclude that thehomoscedasticity assumption issatisfied.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 34/104
SW388R7
Data Analysis &
Computers II
Slide 34
Adding a transformed variable
Second, mark thecheckbox for thetransformation wewant to add to thedata set, and clearthe other checkboxes.
First, move the variablethat we want to transform
to the list box of variablesto test.
Even though we do not need atransformation for any of thevariables in this analysis, we willdemonstrate how to use a script,such as the normality script, to add a
transformed variable to the data set,e.g. a logarithmic transformation forhighest year of school.
Fourth, click on theOK button to
produce the output.
Third, clear thecheckbox for Deletetransformed variablesfrom the data. This willsave the transformedvariable.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 35/104
SW388R7
Data Analysis &
Computers II
Slide 35
The transformed variable in the data editor
If we scroll to the extremeright in the data editor, wesee that the transformedvariable has been added tothe data set.
Whenever we addtransformed variables tothe data set, we should besure to delete them beforestarting another analysis.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 36/104
SW388R7
Data Analysis &
Computers II
Slide 36
The regression to identify outliers
We use the regression procedureto identify both univariate andmultivariate outliers.
We start with the same dialog weused for the last analysis, in whichincome98 as the dependentvariable and sex, earnrs, and
rincom98 were the independentvariables.
Third, we want to save thecalculated values of the outlierstatistics to the data set.
Click on the Save… button tospecify what we want to save.
First, we substitute thelogarithmic transformation of earnrs, logearn, into the listof independent variables.
Second, we change themethod of entry fromStepwise to Enter so that allvariables will be included inthe detection of outliers.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 37/104
SW388R7
Data Analysis &
Computers II
Slide 37
Saving the measures of outliers
Second, mark the checkbox forMahalanobis in the Distances panel. This will compute
Mahalanobis distances for theset of independent variables.
Third, click onthe OK button tocomplete thespecifications.
First, mark the checkbox forStudentized residuals in theResiduals panel. Studentizedresiduals are z-scores computedfor a case based on the data forall other cases in the data set.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 38/104
SW388R7
Data Analysis &
Computers II
Slide 38
The variables for identifying outliers
The variables for identifyingunivariate outliers for thedependent variable are in acolumn which SPSS hasnames sre_1.
The variables for identifyingmultivariate outliers for theindependent variables are ina column which SPSS hasnames mah_1.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 39/104
SW388R7
Data Analysis &
Computers II
Slide 39
Computing the probability for Mahalanobis D²
To compute the probabilityof D², we will use an SPSSfunction in a Computecommand.
First, select theCompute… commandfrom the Transform menu.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 40/104
SW388R7
Data Analysis &
Computers II
Slide 40
Formula for probability for Mahalanobis D²
Third, click on the OK buttonto signal completion of thecomputer variable dialog.
Second, to complete thespecifications for the CDF.CHISQfunction, type the name of thevariable containing the D² scores,mah_1, followed by a comma,followed by the number of variablesused in the calculations, 3.
Since the CDF function (cumulativedensity function) computes thecumulative probability from the leftend of the distribution up through agiven value, we subtract it from 1 toobtain the probability in the upper tailof the distribution.
First, in the target variable text box, type thename "p_mah_1" as an acronym for the probabilityof the mah_1, the Mahalanobis D² score.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 41/104
SW388R7
Data Analysis &
Computers II
Slide 41
Multivariate outliers
Using the probabilities computed in p_mah_1to identify outliers, scroll down through the list
of case to see if we can find cases with aprobability less than 0.001.
There are no outliers for the set of independent variables.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 42/104
SW388R7
Data Analysis &
Computers II
Slide 42
Univariate outliers
Similarly, we can scroll down the values of sre_1, the studentized residual to see theone outlier with a value larger than ± 3.0.
Based on these criteria, there are 4outliers.There are 4 cases that have a scoreon the dependent variable that issufficiently unusual to be considered outliers
(case 20000357: studentizedresidual=3.08; case 20000416: studentizedresidual=3.57; case 20001379: studentizedresidual=3.27; case 20002702: studentizedresidual=-3.23).
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 43/104
SW388R7
Data Analysis &
Computers II
Slide 43
Omitting the outliers
To omit the outliers from theanalysis, we select in thecases that are not outliers.
First, select theSelect Cases…command from theTransform menu.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 44/104
SW388R7
Data Analysis &
Computers II
Slide 44
Specifying the condition to omit outliers
First, mark the If condition is satisfiedoption button toindicate that we willenter a specificcondition forincluding cases.
Second, click on theIf … button to specifythe criteria for inclusionin the analysis.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 45/104
SW388R7
Data Analysis &
Computers II
Slide 45
The formula for omitting outliers
To eliminate the outliers, we
request the cases that are notoutliers.
The formula specifies that weshould include cases if thestudentized residual (regardless of sign) if less than 3 and theprobability for Mahalanobis D² ishigher than the level of
significance, 0.001.
After typing in the formula,click on the Continue buttonto close the dialog box,
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 46/104
SW388R7
Data Analysis &
Computers II
Slide 46
Completing the request for the selection
To complete therequest, we click onthe OK button.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 47/104
SW388R7
Data Analysis &
Computers II
Slide 47
The omitted multivariate outlier
SPSS identifies the excluded cases bydrawing a slash mark through the casenumber. Most of the slashes are forcases with missing data, but we also seethat the case with the low probability forMahalanobis distance is included inthose that will be omitted.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 48/104
SW388R7
Data Analysis &
Computers II
Slide 48
Running the regression without outliers
We run the regression again,excluding the outliers.Select the Regression |Linear command from the
Analyze menu.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 49/104
SW388R7
Data Analysis &
Computers II
Slide 49
Opening the save options dialog
We specify the dependentand independent variables,substituting any transformedvariables required byassumptions.
On our last run, weinstructed SPSS to savestudentized residuals andMahalanobis distance. To
prevent these values frombeing calculated again, clickon the Save… button.
When we used regression todetect outliers, we enteredall variables. Now we aretesting the relationshipspecified in the problem, sowe change the method toStepwise.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 50/104
SW388R7
Data Analysis &
Computers II
Slide 50
Clearing the request to save outlier data
First, clear the checkboxfor Studentized residuals.
Third, click onthe OK button tocomplete the
specifications.
Second, clear thecheckbox formMahalanobis distance.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 51/104
SW388R7
Data Analysis &
Computers II
Slide 51
Opening the statistics options dialog
Once we have removed outliers,we need to check the samplesize requirement for regression.
Since we will need thedescriptive statistics for this,
click on the Statistics… button.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 52/104
SW388R7
Data Analysis &
Computers II
Slide 52
Requesting descriptive statistics
First, mark the checkboxfor Descriptives.
Second, click onthe Continue button tocomplete thespecifications.
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 53/104
SW388R7
Data Analysis &
Computers II
Slide 53
Requesting the output
Having specified theoutput needed for theanalysis, we click onthe OK button to obtainthe regression output.
S 388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 54/104
SW388R7
Data Analysis &
Computers II
Slide 54
Descriptiv e Statistics
17.09 4.073 159
1.55 .499 159
13.76 5.133 159
.424896 .1156559 159
TOTAL FAMILY INCOME
RESPONDENTS SEX
RESPONDENTS INCOME
Logarithm of EARNRS
[LG10( 1+EARNRS)]
M ean Std. Devi ation N
Sample size requirement
The minimum ratio of valid cases to independentvariables for stepwise multiple regression is 5 to1. After removing 4 outliers, there are 159 validcases and 3 independent variables.
The ratio of cases to independent variables for thisanalysis is 53.0 to 1, which satisfies the minimumrequirement. In addition, the ratio of 53.0 to 1satisfies the preferred ratio of 50 to 1.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 55/104
SW388R7
Data Analysis &
Computers II
Slide 55
ANOVAd
1122.398 1 1122.398 117.541 .000a
1499.187 157 9.549
2621.585 158
1572.722 2 786.361 116.957 .000b
1048.863 156 6.723
2621.585 158
1623.976 3 541.325 84.107 .000c
997.609 155 6.436
2621.585 158
Regression
Residual
Total
Regression
Residual
Total
Regression
Residual
Total
Model
1
2
3
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS INCOMEa.
Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
1+EARNRS)]
b.
Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
1+EARNRS)], RESPONDENTS SEX
c.
Dependent Variable: TOTAL FAMILY INCOMEd.
Significance of regression relationship
The probability of the F statistic (84.107) for the regressionrelationship which includes these variables is <0.001, lessthan or equal to the level of significance of 0.01. We rejectthe null hypothesis that there is no relationship betweenthe best subset of independent variables and the dependentvariable (R² = 0).
We support the research hypothesis that there is astatistically significant relationship between the best subsetof independent variables and the dependent variable.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 56/104
SW388R7
Data Analysis &
Computers II
Slide 56
Model Summary
.654a .428 .424 3.090
.775b .600 .595 2.593
.787c .619 .612 2.537
Model
1
2
3
R R Square
Adj usted
R Square
Std. Error of
the Estimate
Predictors: (Constant), RESPONDENTS INCOMEa.
Predictors: (Constant), RESPONDENTS INCOME,
Logarithm of EARNRS [LG10( 1+EARNRS)]
b.
Predictors: (Constant), RESPONDENTS INCOME,
Logarithm of EARNRS [LG10( 1+EARNRS)],
RESPONDENTS SEX
c.
Increase in proportion of variance
Prior to any transformations of variables to satisfythe assumptions of multiple regression or removalof outliers, the proportion of variance in thedependent variable explained by the independentvariables (R²) was 51.1%.
After transformed variables were substituted tosatisfy assumptions and outliers were removedfrom the sample, the proportion of varianceexplained by the regression analysis was 61.9%, a
difference of 10.8%.
The answer to the questionis true with caution.
A caution is added becauseof the inclusion of ordinallevel variables.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 57/104
SW388R7
Data Analysis &
Computers II
Slide 57
Problem 2
In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 58/104
SW388R7
Data Analysis &
Computers II
Slide 58
Dissecting problem 2 - 1
In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problemwith missing data. Use a level of significance of 0.05 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regressionassumptions and removing outliers, the proportion of varianceexplained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The problem may give us different
levels of significance for the analysis.
In this problem, we are told to use0.05 as alpha for the regressionanalysis and the more conservative0.01 as the alpha in testingassumptions.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 59/104
SW388R7
Data Analysis &
Computers II
Slide 59
Dissecting problem 2 - 2
In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problemwith missing data. Use a level of significance of 0.05 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regressionassumptions and removing outliers, the proportion of varianceexplained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The method for selecting variables isderived from the research question.
If we are asked to examine a relationship
without any statement about controlvariables or the best subset of variables, wedo a standard multiple regression.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 60/104
SW388R7
Data Analysis &
Computers II
Slide 60
In the dataset GSS2000.sav, is the following statement true, false, oran incorrect application of a statistic? Assume that there is no problemwith missing data. Use a level of significance of 0.05 for the regressionanalysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"[age], "highest year of school completed" [educ], and "sex" [sex] to thedependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regressionassumptions and removing outliers, the proportion of varianceexplained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
Dissecting problem 2 - 3
The purpose of testing for assumptions and outliers is toidentify a stronger model. The main question to beanswered in this problem is whether or not the usetransformed variables to satisfy assumptions and theremoval of outliers improves the overall relationshipbetween the independent variables and the dependentvariable, as measured by R².
Specifically, the question asks whether ornot the R² for a regression analysis aftersubstituting transformed variables andeliminating outliers is 3.6% higher than aregression analysis using the original formatfor all variables and including all cases.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 61/104
SW388R7
Data Analysis &
Computers II
Slide 61
R² before transformations or removing outliers
To start out, we run astandard multipleregression analysis withprestg80 as the dependentvariable and age, educ, andsex as the independentvariables.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 62/104
SW388R7
Data Analysis &
Computers II
Slide 62
R² before transformations or removing outliers
Prior to any transformations of variablesto satisfy the assumptions of multiple
regression or removal of outliers, theproportion of variance in the dependentvariable explained by the independentvariables (R²) was 27.1%. This is thebenchmark that we will use to evaluatethe utility of transformations and theelimination of outliers.
For this particular question, we are not interested in thestatistical significance the overall relationship prior totransformations and removing outliers. In fact, it ispossible that the relationship is not statistically significantdue to variables that are not normal, relationships thatare not linear, and the inclusion of outliers.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 63/104
SW388R7
Data Analysis &
Computers II
Slide 63
Normality of the dependent variable
In evaluating assumptions, the first step is toexamine the normality of the dependentvariable. If it is not normally distributed, or
cannot be normalized with a transformation, itcan affect the relationships with all othervariables.
To test the normality of the dependentvariable, run the script:NormalityAssumptionAndTransformations.SBS
Second, click on theOK button toproduce the output.
First, move thedependent variablePRESTG80 to the listbox of variables to test.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 64/104
SW388R7
Data Analysis &
Computers II
Slide 64
Normality of the dependent variable
The dependent variable "occupational prestigescore" [prestg80] satisfies the criteria for anormal distribution. The skewness (0.401) andkurtosis (-0.630) were both between -1.0 and+1.0. No transformation is necessary.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 65/104
Data Analysis &
Computers II
Slide 65
Normality of independent variable: Age
After evaluating the dependent variable, weexamine the normality of each metricvariable and linearity of its relationship withthe dependent variable.
To test the normality of age, run the script:NormalityAssumptionAndTransformations.SBS
Second, click on theOK button toproduce the output.
First, move theindependent variableAGE to the list box of variables to test.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 66/104
Data Analysis &
Computers II
Slide 66
Descriptives
45.99 1.023
43.98
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595 .148
-.351 .295
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Devia tion
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
AGE OF RESPONDENT
Statistic Std. Error
Normality of independent variable: Age
The independent variable "age" [age] satisfies the criteria for theassumption of normality, but does not satisfy the assumption of linearity with the dependent variable "occupational prestige score"[prestg80].
In evaluating normality, the skewness (0.595) and kurtosis (-0.351)were both within the range of acceptable values from -1.0 to +1.0.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 67/104
Data Analysis &
Computers II
Slide 67
Linearity and independent variable: Age
To evaluate the linearity of the relationship
between age and occupational prestige, runthe script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on theOK button toproduce the output.
First, move the dependent variablePRESTG80 to the text box for thedependent variable.
Second, move theindependent variable,AGE, to the list box forindependent variables.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 68/104
Data Analysis &
Computers II
Slide 68
Correlations
1 .024 .059 -.004
. .706 .348 .956
255 255 255 255
.024 1 .979** .983**
.706 . .000 .000
255 270 270 270
.059 .979** 1 .926**
.348 .000 . .000
255 270 270 270
-.004 .983** .926** 1
.956 .000 .000 .
255 270 270 270
.041 .995** .994** .960**
.518 .000 .000 .000
255 270 270 270
.096 .916** .978** .832**
.128 .000 .000 .000
255 270 270 270
Pearson Correla tion
Sig. (2-tai led)
N
Pearson Correla tion
Sig. (2-tai led)
N
Pearson Correla tion
Sig. (2-tai led)
N
Pearson Correla tion
Sig. (2-tai led)
N
Pearson Correla tion
Sig. (2-tai led)
N
Pearson Correla tion
Sig. (2-tai led)
N
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Square of AGE [(AGE)**2]
Square Root of AGE
[SQRT(AGE)]
Inverse of AGE [-1/(AGE)]
RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
AGE OF
RESPON
DENT
Logarithm of
AGE
[LG10(AGE)]
Square of
AGE
[(AGE)**2]
S
[S
Linearity and independent variable: Age
The evidence of nonlinearity in therelationship between the independentvariable "age" [age] and the dependentvariable "occupational prestige score"[prestg80] was the lack of statisticalsignificance of the correlation coefficient(r = 0.024). The probability for thecorrelation coefficient was 0.706, greaterthan the level of significance of 0.01. Wecannot reject the null hypothesis that r = 0,and cannot conclude that there is a linearrelationship between the variables.
Since none of the transformations toimprove linearity were successful, it is an
indication that the problem may be a weakrelationship, rather than a curvilinearrelationship correctable by using atransformation. A weak relationship is not aviolation of the assumption of linearity, anddoes not require a caution.
SW388R7
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 69/104
Data Analysis &
Computers II
Slide 69
Transformation for Age
The independent variable age satisfied the criteria
for normality.
The independent variable age did not have a linear
relationship to the dependent variable occupational
prestige. However, none of the transformations
linearized the relationship.
No transformation will be used - it would not help
linearity and is not needed for normality.
SW388R7 Linearity and independent variable:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 70/104
Data Analysis &
Computers II
Slide 70
Linearity and independent variable:Highest year of school completed
To evaluate the linearity of the relationshipbetween highest year of school and
occupational prestige, run the script for theassumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on theOK button toproduce the output.
First, move the dependent variablePRESTG80 to the text box for thedependent variable.
Second, move theindependent variable,EDUC, to the list boxfor independentvariables.
SW388R7 Linearity and independent variable:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 71/104
Data Analysis &
Computers II
Slide 71
Correlations
1 .495** -.512** .528**
. .000 .000 .000
255 254 254 254.495** 1 -.920** .980**
.000 . .000 .000
254 269 269 269
-.512** -.920** 1 -.969**
.000 .000 . .000
254 269 269 269
.528** .980** -.969** 1
.000 .000 .000 .
254 269 269 269
-.518** -.982** .977** -.997**
.000 .000 .000 .000
254 269 269 269
-.423** -.699** .915** -.789**
.000 .000 .000 .000
254 269 269 269
Pearson Correlat ion
Sig. (2-tai led)
NPearson Correlat ion
Sig. (2-tai led)
N
Pearson Correlat ion
Sig. (2-tai led)
N
Pearson Correlat ionSig. (2-tai led)
N
Pearson Correlat ion
Sig. (2-tai led)
N
Pearson Correlat ion
Sig. (2-tai led)
N
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
HIGHEST YEAR OF
SCHOOL COMPLETED
Logarithm of EDUC
[LG10( 21-EDUC)]
Square of EDUC[(EDUC)**2]
Square Root of EDUC
[SQRT( 21-EDUC)]
Inverse of EDUC [-1/(
21-EDUC)]
RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
HIGHEST
YEAR OF
SCHOOL
COMPLETED
Logarithm of
EDUC [LG10(
21-EDUC)]
Square of
EDUC
[(EDUC)**2]
**
Linearity and independent variable:Highest year of school completed
The independent variable "highest yearof school completed" [educ] satisfiesthe criteria for the assumption of linearity with the dependent variable"occupational prestige score"
[prestg80], but does not satisfy theassumption of normality. The evidenceof linearity in the relationship betweenthe independent variable "highest yearof school completed" [educ] and thedependent variable "occupationalprestige score" [prestg80] was thestatistical significance of the correlationcoefficient (r = 0.495). The probability
for the correlation coefficient was<0.001, less than or equal to the levelof significance of 0.01. We reject thenull hypothesis that r = 0 and concludethat there is a linear relationshipbetween the variables.
SW388R7
D A l i & Normality of independent variable:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 72/104
Data Analysis &
Computers II
Slide 72
Normality of independent variable:Highest year of school completed
Second, click on theOK button toproduce the output.
First, move the
dependent variableEDUC to the list box of variables to test.
To test the normality of EDUC, Highest year
of school completed, run the script:
NormalityAssumptionAndTransformations.SBS
SW388R7
D t A l i & Normality of independent variable:
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 73/104
Data Analysis &
Computers II
Slide 73
Descriptiv es
13.12 .179
12.77
13.47
13.14
13.00
8.5832.930
2
20
18
3.00
-.137 .149
1.246 .296
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
VarianceStd. Devia tion
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
HIGHEST YEAR OF
SCHOOL COMPLETED
Statistic Std. Error
Normality of independent variable:Highest year of school completed
In evaluating normality, the skewness (-0.137) was between -1.0and +1.0, but the kurtosis (1.246) was outside the range from -1.0to +1.0. None of the transformations for normalizing the distributionof "highest year of school completed" [educ] were effective.
SW388R7
D t A l i &
T f i f hi h f h l
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 74/104
Data Analysis &
Computers II
Slide 74
Transformation for highest year of school
The independent variable, highest year of school,had a linear relationship to the dependent variable,occupational prestige.
The independent variable, highest year of school, didnot satisfy the criteria for normality. None of thetransformations for normalizing the distribution of "highest year of school completed" [educ] wereeffective.
No transformation will be used - it would not helpnormality and is not needed for linearity. A cautionshould be added to any findings.
SW388R7
Data Analysis &
H d ti it
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 75/104
Data Analysis &
Computers II
Slide 75
Homoscedasticity: sex
To evaluate the homoscedasticity of the
relationship between sex and occupationalprestige, run the script for the assumption of homogeneity of variance:
HomoscedasticityAssumptionAndTransformations.SBS
Third, click on theOK button toproduce the output.
First, move the dependent variablePRESTG80 to the text box for thedependent variable.
Second, move theindependent variable,SEX, to the list box forindependent variables.
SW388R7
Data Analysis &
H d ti it
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 76/104
Data Analysis &
Computers II
Slide 76
Homoscedasticity: sex
Based on the Levene Test, thevariance in "occupationalprestige score" [prestg80] ishomogeneous for the categoriesof "sex" [sex]. The probability
associated with the LeveneStatistic (0.808) is greater thanthe level of significance, so wefail to reject the null hypothesisand conclude that thehomoscedasticity assumption issatisfied.
Even if we violate the
assumption, we would not do atransformation since it couldimpact the relationships of theother independent variableswith the dependent variable.
SW388R7
Data Analysis &
Addi t f d i bl
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 77/104
Data Analysis &
Computers II
Slide 77
Adding a transformed variable
Second, mark thecheckbox for thetransformation wewant to add to thedata set, and clearthe other checkboxes.
First, move the variable
that we want to transformto the list box of variablesto test.
Even though we do not need atransformation for any of thevariables in this analysis, we willdemonstrate how to use a script,such as the normality script, to add atransformed variable to the data set,
e.g. a logarithmic transformation forhighest year of school.
Fourth, click on theOK button toproduce the output.
Third, clear thecheckbox for Deletetransformed variablesfrom the data. This willsave the transformedvariable.
SW388R7
Data Analysis &
Th t f d i bl i th d t dit
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 78/104
Data Analysis &
Computers II
Slide 78
The transformed variable in the data editor
If we scroll to the extremeright in the data editor, wesee that the transformedvariable has been added tothe data set.
Whenever we addtransformed variables tothe data set, we should besure to delete them beforestarting another analysis.
SW388R7
Data Analysis &
Th i t id tif tli
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 79/104
Data Analysis &
Computers II
Slide 79
The regression to identify outliers
We can use the regressionprocedure to identify bothunivariate and multivariateoutliers.
We start with the same dialog weused for the last analysis, in whichprestg90 as the dependentvariable and age, educ, and sex
were the independent variables.
If we need to use any transformedvariables, we would substitutethem now.
We will save the calculatedvalues of the outlierstatistics to the data set.
Click on the Save… button tospecify what we want tosave.
SW388R7
Data Analysis &
S i g th f tli
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 80/104
Data Analysis &
Computers II
Slide 80
Saving the measures of outliers
Second, mark the checkbox forMahalanobis in the Distances panel. This will compute
Mahalanobis distances for theset of independent variables.
Third, click onthe OK button tocomplete the
specifications.
First, mark the checkbox forStudentized residuals in theResiduals panel. Studentizedresiduals are z-scores computedfor a case based on the data forall other cases in the data set.
SW388R7
Data Analysis &
The variables for identifying outliers
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 81/104
Data Analysis &
Computers II
Slide 81
The variables for identifying outliers
The variables for identifyingunivariate outliers for thedependent variable are in acolumn which SPSS hasnames sre_1.
The variables for identifyingmultivariate outliers for theindependent variables are ina column which SPSS hasnames mah_1.
SW388R7
Data Analysis &
Computing the probability for Mahalanobis D²
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 82/104
Data Analysis &
Computers II
Slide 82
Computing the probability for Mahalanobis D²
To compute the probabilityof D², we will use an SPSSfunction in a Computecommand.
First, select theCompute… commandfrom the Transform menu.
SW388R7
Data Analysis &
Formula for probability for Mahalanobis D²
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 83/104
y
Computers II
Slide 83
Formula for probability for Mahalanobis D²
Third, click on the OK buttonto signal completion of thecomputer variable dialog.
Second, to complete thespecifications for the CDF.CHISQfunction, type the name of thevariable containing the D² scores,mah_1, followed by a comma,followed by the number of variablesused in the calculations, 3.
Since the CDF function (cumulativedensity function) computes thecumulative probability from the leftend of the distribution up through agiven value, we subtract it from 1 toobtain the probability in the upper tailof the distribution.
First, in the target variable text box, type thename "p_mah_1" as an acronym for the probabilityof the mah_1, the Mahalanobis D² score.
SW388R7
Data Analysis &
The multivariate outlier
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 84/104
y
Computers II
Slide 84
The multivariate outlier
Using the probabilities computed in p_mah_1to identify outliers, scroll down through the listof case to see the one case with a probabilityless than 0.001.
There is 1 case that has a combination of scores on the independent variables that issufficiently unusual to be considered an outlier(case 20001984: Mahalanobis D²=16.97,p=0.0007).
SW388R7
Data Analysis &
The univariate outlier
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 85/104
y
Computers II
Slide 85
The univariate outlier
Similarly, we can scroll down the values of sre_1, the studentized residual to see theone outlier with a value larger than 3.0.
There is 1 case that has a score on thedependent variable that is sufficientlyunusual to be considered an outlier (case20000391: studentized residual=4.14).
SW388R7
Data Analysis &
Omitting the outliers
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 86/104
Computers II
Slide 86
Omitting the outliers
To omit the outliers from theanalysis, we select in thecases that are not outliers.
First, select theSelect Cases…command from theTransform menu.
SW388R7
Data Analysis &
Specifying the condition to omit outliers
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 87/104
Computers II
Slide 87
Specifying the condition to omit outliers
First, mark the If
condition is satisfiedoption button toindicate that we willenter a specificcondition forincluding cases.
Second, click on theIf … button to specifythe criteria for inclusionin the analysis.
SW388R7
Data Analysis &
The formula for omitting outliers
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 88/104
Computers II
Slide 88
The formula for omitting outliers
To eliminate the outliers, werequest the cases that are notoutliers.
The formula specifies that weshould include cases if thestudentized residual (regardless of sign) if less than 3 and theprobability for Mahalanobis D² ishigher than the level of
significance, 0.001.
After typing in the formula,click on the Continue buttonto close the dialog box,
SW388R7
Data Analysis &
Completing the request for the selection
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 89/104
Computers II
Slide 89
Completing the request for the selection
To complete therequest, we click onthe OK button.
SW388R7
Data Analysis &
C t II The omitted multivariate outlier
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 90/104
Computers II
Slide 90
The omitted multivariate outlier
SPSS identifies the excluded cases by
drawing a slash mark through the casenumber. Most of the slashes are forcases with missing data, but we also seethat the case with the low probability forMahalanobis distance is included inthose that will be omitted.
SW388R7
Data Analysis &
C t II Running the regression without outliers
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 91/104
Computers II
Slide 91
Running the regression without outliers
We run the regression again,excluding the outliers.Select the Regression |Linear command from the
Analyze menu.
SW388R7
Data Analysis &
Comp ters II Opening the save options dialog
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 92/104
Computers II
Slide 92
Opening the save options dialog
If specify the dependent anindependent variables. If we wanted to use anytransformed variables wewould substitute them now.
On our last run, weinstructed SPSS to savestudentized residuals andMahalanobis distance. Toprevent these values frombeing calculated again, click
on the Save… button.
SW388R7
Data Analysis &
Computers II Clearing the request to save outlier data
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 93/104
Computers II
Slide 93
Clearing the request to save outlier data
First, clear the checkboxfor Studentized residuals.
Third, click onthe OK button tocomplete the
specifications.
Second, clear thecheckbox formMahalanobis distance.
SW388R7
Data Analysis &
Computers II Opening the statistics options dialog
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 94/104
Computers II
Slide 94
Opening the statistics options dialog
Once we have removed outliers,we need to check the samplesize requirement for regression.
Since we will need thedescriptive statistics for this,
click on the Statistics… button.
SW388R7
Data Analysis &
Computers II Requesting descriptive statistics
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 95/104
Computers II
Slide 95
Requesting descriptive statistics
First, mark the checkboxfor Descriptives.
Second, click onthe Continue button tocomplete thespecifications.
SW388R7
Data Analysis &
Computers II Requesting the output
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 96/104
Computers II
Slide 96
Requesting the output
Having specified the
output needed for theanalysis, we click onthe OK button to obtainthe regression output.
SW388R7
Data Analysis &
Computers II Sample size requirement
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 97/104
Computers II
Slide 97
Sample size requirement
The minimum ratio of valid cases to independentvariables for multiple regression is 5 to 1. Afterremoving 2 outliers, there are 252 valid cases and3 independent variables.
The ratio of cases to independent variables for thisanalysis is 84.0 to 1, which satisfies the minimumrequirement. In addition, the ratio of 84.0 to 1
satisfies the preferred ratio of 15 to 1.
SW388R7
Data Analysis &
Computers II Significance of regression relationship
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 98/104
Computers II
Slide 98
Significance of regression relationship
The probability of the F statistic (36.639) for theoverall regression relationship is <0.001, less thanor equal to the level of significance of 0.05. Wereject the null hypothesis that there is norelationship between the set of independentvariables and the dependent variable (R² = 0).
We support the research hypothesis that there is a
statistically significant relationship between theset of independent variables and the dependentvariable.
SW388R7
Data Analysis &
Computers II Increase in proportion of variance
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 99/104
Computers II
Slide 99
Increase in proportion of variance
Prior to any transformations of variables to satisfy
the assumptions of multiple regression or removalof outliers, the proportion of variance in thedependent variable explained by the independentvariables (R²) was 27.1%. No transformedvariables were substituted to satisfy assumptions,but outliers were removed from the sample.
The proportion of variance explained by theregression analysis after removing outliers was30.7%, a difference of 3.6%.
The answer to the questionis true with caution.
A caution is added becauseof a violation of regressionassumptions.
SW388R7
Data Analysis &
Computers II Impact of assumptions and outliers - 1
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 100/104
C p
Slide 100
Impact of assumptions and outliers 1
The following is a guide to the decision process for answeringproblems about the impact of assumptions and outliers on analysis:
Inappropriate
application of a statistic
Yes
NoDependent variablemetric?Independent variablesmetric or dichotomous?
Run baseline regression andrecord R² for futurereference, using method forincluding variables identifiedin the research question.
Yes
Ratio of cases toindependent variables atleast 5 to 1?
Yes
No Inappropriateapplication of a statistic
SW388R7
Data Analysis &
Computers II Impact of assumptions and outliers - 2
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 101/104
p
Slide 101
Impact of assumptions and outliers 2
Is the dependent variablenormally distributed?
Yes
No
Try:1. Logarithmic transformation2. Square root transformation3. Inverse transformation
If unsuccessful, add caution
Metric IV’s normallydistributed and linearlyrelated to DV
Yes
No
Try:1. Logarithmic transformation2. Square root transformation(3. Square transformation)4. Inverse transformation
If unsuccessful, add caution
DV is homoscedastic forcategories of dichotomous IV’s?
Yes
NoAdd caution
SW388R7
Data Analysis &
Computers II Impact of assumptions and outliers - 3
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 102/104
p
Slide 102
pact o assu pt o s a d outl e s 3
Are there univariateoutliers (DV) ormultivariate outliers
(IVs)?
No
Yes
Ratio of cases toindependent variables atleast 5 to 1?
No Inappropriateapplication of
a statistic
Yes
Run regression again usingtransformed variables andeliminating outliers
Remove outliers from data
Substituting any transformed variables, run
regression using direct entry to include allvariables to request statistics for detectingoutliers
SW388R7
Data Analysis &
Computers II Impact of assumptions and outliers - 4
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 103/104
Slide 103
p p
Probability of ANOVA test of
regression less than/equal to
level of significance?
Yes
NoFalse
Increase in R² correct?
NoFalse
Yes
Yes
Yes
Satisfies ratio for preferred
sample size: 15 to 1
(stepwise: 50 to 1)
Yes
NoTrue with caution
SW388R7
Data Analysis &
Computers II Impact of assumptions and outliers - 5
7/16/2019 MultipleRegression_AssumptionsAndOUtliers
http://slidepdf.com/reader/full/multipleregressionassumptionsandoutliers 104/104
Slide 104
p p
Yes
Other cautions added for
ordinal variables or violation
of assumptions?
Yes
No
True with caution
True
Yes