Date post: | 24-Nov-2014 |
Category: |
Documents |
Upload: | fabrizio-palmucci |
View: | 138 times |
Download: | 0 times |
1
Regression Analysiswith SPSS
Robert A. Yaffee, Ph.D.Statistics, Mapping and Social Science Group
Academic Computing Services Information Technology Services
New York UniversityOffice: 75 Third Ave Level C3
Tel: 212.998.3402E-mail: [email protected]
February 04
2
Outline1. Conceptualization2. Schematic Diagrams of Linear Regression
processes3. Using SPSS, we plot and test relationships
for linearity4. Nonlinear relationships are transformed to
linear ones5. General Linear Model6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression coefficients
7. The Prediction Interval and its derivation8. Model Assumptions
1. Explanation2. Testing3. Assessment
9. Alternatives when assumptions are unfulfilled
3
Conceptualization of Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition of effects
4
Hypothesis Testing
• For example: hypothesis 1 : X is statistically significantly related to Y.– The relationship is positive (as X
increases, Y increases) or negative (as X decreases, Y increases).
– The magnitude of the relationship is small, medium, or large.
If the magnitude is small, then a unit change in x is associated with a small change in Y.
5
Regression AnalysisHave a clear notion of what you can and
cannot do with regression analysis
• Conceptualization– A Path Model of a Regression
Analysis
Path Diagram of A Linear RegressionAnalysis
YY
X1
X2
x3
error
i iY k b x b x b x e 1 1 2 2 3 3
6
A Path AnalysisDecomposition of Effects into Direct,Indirect, Spurious, and Total Effects
X1
Y3
X2
Y1
Y2
Error
A
B
C
D
E F
Direct Effects:Paths C, E, F
Indirect Effects:Paths
AC, BE, DF
Total Effects:Sum of Direct and
Indirect Effects
Spurious effects are due tocommon (antecedent) causes
Error
Error
Error
In a path analysis, Yi is endogenous. It is the outcome of several paths.
Direct effects on Y3: C,E, F
Indirect effects on Y3: BF, BDF
Total Effects= Direct + Indirect effects
7
Interaction Analysis
X1
X2
Y
A
B
C
Y= K + aX1 + BX2 + CX1*X2
Interaction coefficient: C
X1 and X2 must be in model for interaction to be properly specified.
8
A Precursor to Modeling with Regression
• Data Exploration: Run a scatterplot matrix and search for linear relationships with the dependent variable.
9
Click on graphs and then on scatter
10
When the scatterplot dialog box appears, select
Matrix
11
A Matrix of Scatterplots will appear
Search for distinct linear relationships
12
13
14
Decomposition of the Sums of Squares
15
Graphical Decomposition of Effects
X
Yy a b x
X
Y
iY
ˆi iy y error
y y r e g re s s io n e f fe c t iy y Total Effect
D ec o m p o s it io n o f E ffec ts
16
Decomposition of the sum of squares
ˆ ˆ
( )
ˆ ˆ
ˆ ˆ( ) ( ) ( )
ˆ ˆ( ) ( ) ( )
i i i i
i i i i
n n n
i i i ii i i
Y Y Y Y Y Y
total effect error effects regression model effect
Y Y Y Y Y Y per case i
Y Y Y Y Y Y per case i
Y Y Y Y Y Y for data set
2 2 2
2 2 2
1 1 1
17
Decomposition of the sum of squares
• Total SS = model SS + error SS
and if we divide by df
• This yields the Variance Decomposition: We have the total variance= model variance + error variance
ˆ ˆ( ) ( ) ( )n n n
i i i ii i i
Y Y Y Y Y Y
n n k k
2 2 2
1 1 1
1 1
18
F test for significance and R2 for magnitude of effect
• R2 = Model var/total var
( , )k n k
RkFR
n k
2
1 211
ˆ( )
ˆ( )
n
ii
n
i ii
Y Y
kRY Y
n k
2
1
2
2
1
1
•F test for model significance= Model Var/Error Var
19
ANOVA tests the significance of the Regression Model
20
The Multiple Regression Equation
• We proceed to the derivation of its components: – The intercept: a– The regression parameters, b1 and b2
i iY a b x b x e 1 1 2 2
21
Derivation of the Intercept
n n n
i i ii i i
n n n n
i i i ii i i i
n
ii
n n n
i i ii i i
a y b x
n n
i ii i
y a bx e
e y a bx
e y a b x
Because by definition e
y a b x
na y b x
a y bx
1 1 1
1 1 1 1
1
1 1 1
1 1
0
0
22
Derivation of the Regression Coefficient
:
( )
( )
( )
( )
i i i
i i i
n n
i i ii i
n n
i i ii i
n
i n ni
i i i ii i
n n
i i i ii i
n
i iin
ii
Given y a b x e
e y a b x
e y a b x
e y a b x
ex y b x x
b
x y b x x
x yb
x
1 1
2 2
1 1
2
1
1 1
1 1
1
2
1
2 2
0 2 2
23
• If we recall that the formula for the correlation coefficient can be expressed as follows:
24
from which it can be seen that the regression coefficient b,is a function of r.
n
i ii 1
n n2 2i i
i 1 i 1
i
i
x yr
x y
where
x x x
y y y
n
i ii 1
j n2
i 1
x yb
x
* yj
x
sdb r
sd
25
Extending the bivariate caseTo the Multiple linear regression case
26
1 2 1 2
1 2
1 2
. 2* (6)
1
yx yx x x yyx x
x x x
r r r sd
r sd
2 1 1 2
2 1
1 2
. 2* (7)
1
yx yx x x yyx x
x x x
r r r sd
r sd
1 1 2 2 (8)a Y b x b x
It is also easy to extend the bivariate interceptto the multivariate case as follows.
27
Significance Tests for the Regression Coefficients
1. We find the significance of the parameter estimates by using the F or t test.
2. The R2 is the proportion of variance explained.
3. 2 (n-1)= 1-(1-R )
(n-p-1)Adjusted R
where n sample size
p number of parameters inmodel
2
28
F and T tests for significance for overall
model
/
( ) /( )
2
2
Model varianceF
error variance
R p
1 R n p 1
where
p number of parameters
n sample size
( )* 2
2
t F
n 2 r
1 r
29
Significance tests
• If we are using a type II sum of squares, we are dealing with the ballantine. DV Variance explained = a + b
30
Significance tests
T tests for statistical significance
a
b
tse
bt
se
0
0
31
Significance tests
Standard Error of intercept
( )*
( ) ( )
ia
i
xY YSE
n n n x x
22
2
1
2 1
ˆ
ˆ
ˆ
b
n
i
SEx
where std devof residual
e
n
2
2
2 1
2
Standard error of regression coefficient
32
Programming Protocol
After invoking SPSS, procede to File, Open, Data
33
Select a Data Set (we choose employee.sav)
and click on open
34
We open the data set
35
To inspect the variable formats, click on variable
view on the lower left
36
Because gender is a string variable, we need to
recode gender into a numeric format
37
We autorecode gender by clicking on transform and
then autorecode
38
We select gender and move it into the variable
box on the right
39
Give the variable a new name and click on add
new name
40
Click on ok and the numeric variable sex is
created
It has values 1 for female and 2 for male and those values labelsare inserted.
41
To invoke Regression analysis,
Click on Analyze
42
Click on Regression and then linear
43
Select the dependent variable: Current Salary
44
Enter it in the dependent variable box
45
Entering independent variables
• These variables are entered in blocks. First the potentially confounding covariates that have to entered.
• We enter time on job, beginning salary, and previous experience.
46
After entering the covariates, we click on
next
47
We now enter the hypotheses we wish to
test• We are testing for minority or
sex differences in salary after controlling for the time on job, previous experience, and beginning salary.
• We enter minority and numeric gender (sex)
48
After entering these variables, click on
statistics
49
We select the following statistics from the dialog box and click on continue
50
Click on plots to obtain the plots dialog box
51
We click on OK to run the regression analysis
52
Navigation window (left) and output window(right)
This shows that SPSS is reading the variables correctly
53
Variables Entered and Model Summary
54
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis
55
Full ModelCoefficients
12036.3 1.83
165.17 23.64
2882.84 1419.7
CurSal BeginSal
Jobtime Exper
gender Minority
56
We omit insignificant variables and rerun the analysis to obtain trimmed
model coefficients
12126.5 1.85
163.20 24.36
2694.30
CurSal BeginSal
Jobtime Exper
gender
57
Beta weights
• These are standardized regression coefficients used to compare the contribution to the explanation of the variance of the dependent variable within the model.
58
T tests and signif.
• These are the tests of significance for each parameter estimate.
• The significance levels have to be less than .05 for the parameter to be statistically significant.
59
Assumptions of the Linear Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model (no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
60
Explanation of the Assumptions
1. 1. Linear Functional form1. Does not detect curvilinear relationships
2. Independent observations1. Representative samples2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests3. Normality of the residuals
1. Permits proper significance testing4. Equality of variance
1. Heteroskedasticity precludes generalization and external validity
2. This also warps the significance tests5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
61
Diagnostic Tests for the Regression Assumptions
1. Linearity tests: Regression curve fitting1. No level shifts: One regime
2. Independence of observations: Runs test3. Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test4. Homogeneity of variance if the residuals: White’s
General Specification test5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals6. Multicollinearity: Correlation matrix of independent
variables.. Condition index or condition number7. No serious outlier influence: tests of additive
outliers: Pulse dummies.1. Plot residuals and look for high leverage of residuals2. Lists of Standardized residuals 3. Lists of Studentized residuals4. Cook’s distance or leverage statistics
62
Explanation of Diagnostics
1. Plots show linearity or nonlinearity of relationship
2. Correlation matrix shows whether the independent variables are collinear and correlated.
3. Representative sample is done with probability sampling
63
Explanation of Diagnostics
Tests for Normality of the residuals. The residuals are saved and then subjected to either of:Kolmogorov-Smirnov Test: Tests
the limit of the theoretical cumulative normal distribution against your residual distribution.
Nonparametric Tests
1 sample K-S test
64
Collinearity Diagnostics
R
small tolerances imply problems
ToleranceSmall intercorrelations among indep vars
means VIF
VIF signifies problems
21
Variance InflationFactor (VIF)
1
1
10
Tolerance
65
More Collinearity Diagnostics
condition numbers
= maximum eigenvalue/minimum eigenvalue.If condition numbers are between
100 and 1000, there is moderate to strong collinearity
condition index k
where k condition number
If Condition index > 30 then there is strong collinearity
66
Outlier Diagnostics
1. Residuals. 1. The predicted value minus the actual
value. This is otherwise known as the error.
2. Studentized Residuals 1. the residuals divided by their
standard errors without the ith observation
3. Leverage, called the Hat diag1. This is the measure of influence of
each observation 4. Cook’s Distance:
1. the change in the statistics that results from deleting the observation. Watch this if it is much greater than 1.0.
67
Outlier detection
• Outlier detection involves the determination whether the residual (error = predicted – actual) is an extreme negative or positive value.
• We may plot the residual versus the fitted plot to determine which errors are large, after running the regression.
68
Create Standardized Residuals
• A standardized residual is one divided by its standard deviation.
ˆi istandardized
y yresid
swhere s std devof residuals
69
Limits of Standardized Residuals
If the standardized residuals have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less than 3.5, as these are, then there are no outliers
While outliers by themselves only distort mean prediction when the sample size is small enough, it is important to gauge the influence of outliers.
70
Outlier Influence
• Suppose we had a different data set with two outliers.
• We tabulate the standardized residuals and obtain the following output:
71
Outlier a does not distort and outlier b does.
72
Studentized Residuals
• Alternatively, we could form studentized residuals. These are distributed as a t distribution with df=n-p-1, though they are not quite independent. Therefore, we can approximately determine if they are statistically significant or not.
• Belsley et al. (1980) recommended the use of studentized residuals.
73
Studentized Residual
( )
( )
( )i
s ii
i
si
i
i
ee
s h
where
e studentized residual
s standard deviationwhereithobs is deleted
h leverage statistic
2 1
These are useful in estimating the statistical significanceof a particular observation, of which a dummy variableindicator is formed. The t value of the studentized residualwill indicate whether or not that observation is a significantoutlier.The command to generate studentized residuals, called rstudt is:predict rstudt, rstudent
74
Influence of Outliers
1. Leverage is measured by the diagonal components of the hat matrix.
2. The hat matrix comes from the formula for the regression of Y.
ˆ '( ' ) '
'( ' ) ' ,
,
ˆ
Y X X X X X Y
where X X X X the hat matrix H
Therefore
Y HY
1
1
75
Leverage and the Hat matrix
1. The hat matrix transforms Y into the predicted scores.
2. The diagonals of the hat matrix indicate which values will be outliers or not.
3. The diagonals are therefore measures of leverage.
4. Leverage is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has.
5. The trace of the hat matrix = the number of variables in the model.
6. When the leverage > 2p/n then there is high leverage according to Belsley et al. (1980) cited in Long, J.F. Modern Methods of Data Analysis (p.262). For smaller samples, Vellman and Welsch (1981) suggested that 3p/n is the criterion.
76
Cook’s D
1. Another measure of influence.
2. This is a popular one. The formula for it is:
'( )
i ii
i i
h eCook s D
p h s h
2
2
1
1 1
Cook and Weisberg(1982) suggested that values of D that exceeded 50% of the F distribution (df = p, n-p)are large.
77
Using Cook’s D in SPSS
• Cook is the option /R• Finding the influential outliers• List cook, if cook > 4/n• Belsley suggests 4/(n-k-1) as a cutoff
78
DFbeta
• One can use the DFbetas to ascertain the magnitude of influence that an observation has on a particular parameter estimate if that observation is deleted.
( )
( )
.
ij j jj
jj
j
b b uDFbeta
u h
where u residuals of
regressionof x on remaining xs
21
79
Programming Diagnostic Tests
Testing homoskedasiticitySelect histogram, normal probability plot,
and insert *zresid in Yand *zpred in X
Then click on continue
80
Click on Save to obtain the Save dialog box
81
We select the following
Then we click on continue, go back to the Main Regression Menu and click on OK
82
Check for linear Functional Form
• Run a matrix plot of the dependent variable against each independent variable to be sure that the relationship is linear.
83
Move the variables to be graphed into the box on the upper right, and
click on OK
84
Residual Autocorrelation check
nt t
i t
Durbin Watson d
tests first order
autocorrelationof residuals
e ed
e
1
1
2
See significance tables for this statistic
85
Run the autocorrelation function from the Trends Module for a better analysis
86
Testing for Homogeneity of variance
87
Normality of residuals can be visually inspected from the histogram with the superimposed normal curve.
Here we check the skewness for symmetry and the kurtosis for peakedness
88
Kolmogorov Smirnov Test: An objective test of normality
89
90
91
Multicollinearity test with the correlation matrix
92
93
94
Alternatives to Violations of Assumptions
• 1. Nonlinearity: Transform to linearity if there is nonlinearity or run a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations regression or a median regression (available in other packages or generalized linear models [ SPLUS glm, STATA glm, or SAS Proc MODEL or PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares regression (SPSS) or white estimator (SAS, Stata, SPLUS). One can use a robust regression procedure (SAS, STATA, or SPLUS) to obtain downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends module or either Prais or Newey-West procedure in STATA.
• 4. Multicollinearity: components regression or ridge regression or proxy variables. 2sls in SPSS or ivreg in stata or SAS proc model or proc syslin.
95
Model Building Strategies
• Specific to General: Cohen and Cohen
• General to Specific: Hendry and Richard
• Extreme Bounds analysis: E. Leamer.
96
Nonparametric Alternatives
1. If there is nonlinearity, transform to linearity first.
2. If there is heteroskedasticity, use robust standard errors with STATA or SAS or SPLUS.
3. If there is non-normality, use quantile regression with bootstrapped standard errors in STATA or SPLUS.
4. If there is autocorrelation of residuals, use Newey-West autoregression or First order autocorrelation correction with Areg. If there is higher order autocorrelation, use Box Jenkins ARIMA modeling.