Autocorrelation in Regression Analysis
•Tests for Autocorrelation•Examples• Durbin-Watson Tests• Modeling Autoregressive Relationships
What causes autocorrelation?
• Misspecification
• Data Manipulation– Before receipt– After receipt
• Event Inertia
• Spatial ordering
Checking for Autocorrelation
• Test: Durbin-Watson statistic:
d (ei ei 1)
2ei
2 , for n and K -1 d.f.
Positive Zone of No Autocorrelation Zone of Negativeautocorrelation indecision indecision autocorrelation|_______________|__________________|_____________|_____________|__________________|___________________|0 d-lower d-upper 2 4-d-upper 4-d-lower 4
Autocorrelation is clearly evident
Ambiguous – cannot rule out autocorrelation
Autocorrelation in not evident
Consider the following regression:
Because this is time series data, we should consider the possibility of autocorrelation. To run the Durbin-Watson, first we have to specify the data as time series with the tsset command. Next we use the dwstat command.
Durbin-Watson d-statistic( 3, 328) = .2109072
Source | SS df MS Number of obs = 328-------------+------------------------------ F( 2, 325) = 52.63 Model | .354067287 2 .177033643 Prob > F = 0.0000 Residual | 1.09315071 325 .003363541 R-squared = 0.2447-------------+------------------------------ Adj R-squared = 0.2400 Total | 1.447218 327 .004425743 Root MSE = .058
------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ice | .060075 .006827 8.80 0.000 .0466443 .0735056 quantity | -2.27e-06 2.91e-07 -7.79 0.000 -2.84e-06 -1.69e-06 _cons | .2783773 .0077177 36.07 0.000 .2631944 .2935602------------------------------------------------------------------------------
Find the D-upper and D-lower• Check a Durbin Watson table for the
numbers for d-upper and d-lower.• http://hadm.sph.sc.edu/courses/J716/Dw.html
• For n=20 and k=2, α = .05 the values are:– Lower = 1.643– Upper = 1.704
Durbin's alternative test for autocorrelation--------------------------------------------------------------------------- lags(p) | chi2 df Prob > chi2-------------+------------------------------------------------------------- 1 | 1292.509 1 0.0000--------------------------------------------------------------------------- H0: no serial correlation
Alternatives to the d-statistic
• The d-statistic is not valid in models with a lagged dependent variable– In the case of a lagged LHS variable you must
use the Durbin-a test (the command is durbina in Stata)
• Also, the d-statistic is only for first order autocorrelation. In other instances you may use the Durbin-a– Why would you suspect other than 1st order
autocorrelation?
The Runs Test• An alternative to the D-W test is a
formalized examination of the signs of the residuals. We would expect that the signs of the residuals will be random in the absence of autocorrelation.
• The first step is to estimate the model and predict the residuals.
Runs continued
• Next, order the signs of the residuals against time (or spatial ordering in the case of cross-sectional data) and see if there are excessive “runs” of positives or negatives. Alternatively, you can graph the residuals and look for the same trends.
Runs test continuedWhere n = number of observations, 1n = number
of + symbols, 2n = the number of – symbols, and k = the number of runs:
)1()(
)2(2
212
21
2121212
nnnn
nnnnnnk
The final step is to use the expected mean and deviation in a standard t-test
Stata does this automatically with the runtest command!
12
)(21
21
nn
nnkE
Visual diagnosis of autocorrelation (in a single series)
• A correlogram is a good tool to identify if a series is autocorrelated
-0.5
00
.00
0.5
01
.00
Au
tocorr
ela
tion
s o
f pri
ce
0 10 20 30 40Lag
Bartlett's formula for MA(q) 95% confidence bands
Dealing with autocorrelation• D-W is not appropriate for auto-regressive (AR)
models, where:
• In this case, we use the Durbin alternative test• For AR models, need to explicitly estimate the
correlation between Yi and Yi-1 as a model parameter
• Techniques:• AR1 models (closest to regression; 1st order only)• ARIMA (any order)
...22110 XbYbbY itit
Dealing with Autocorrelation
• There are several approaches to resolving problems of autocorrelation. – Lagged dependent variables– Differencing the Dependent variable– GLS – ARIMA
Lagged dependent variables• The most common solution
– Simply create a new variable that equals Y at t-1, and use as a RHS variable
• To do this in Stata, simply use the generate command with the new variable equal to L.variable
– gen lagy = L.y– gen laglagy = L2.y
• This correction should be based on a theoretic belief for the specification
• May cause more problems than it solves• Also costs a degree of freedom (lost observation)
– There are several advanced techniques for dealing with this as well
Differencing
• Differencing is simply the act of subtracting the previous observation value from the current observation.
• To do this in Stata, again use the generate command with a capital D (instead of the L for lags)
– This process is effective; however, it is an EXPENSIVE correction
– This technique “throws away” long-term trends – Assumes the Rho = 1 exactly
1.1 tt xxxD
GLS and ARIMA
• GLS approaches use maximum likelihood to estimate Rho and correct the model– These are good corrections, and can be
replicated in OLS
• ARIMA is an acronym for Autoregressive Integrated Moving Average– This process is a univariate “filter” used to
cleanse variables of a variety of pathologies before analysis
Corrections based on Rho
• There are several ways to estimate rho, the most simple being calculating it from the residuals
n
tt
n
ttt
e
ee
1
2
21
We then estimate the regression by transforming the regressors so that: and This gives the regression:
1* ˆ tt xxx
1* ˆ tt yyy
*110
* xy
High tech solutions
• Stata also offers the option of estimating the model with the AR (with multiple ways of estimating rho). There is also what is known as a prais-winsten regression which generates values for the lost observation
• For the truly adventurous, there is also the option of doing a full ARIMA model
Prais-winsten regression• Prais-Winsten AR(1) regression -- iterated estimates
• Source | SS df MS Number of obs = 328• -------------+------------------------------ F( 2, 325) = 15.39• Model | .012722308 2 .006361154 Prob > F = 0.0000• Residual | .134323736 325 .000413304 R-squared = 0.0865• -------------+------------------------------ Adj R-squared = 0.0809• Total | .147046044 327 .000449682 Root MSE = .02033
• ------------------------------------------------------------------------------• price | Coef. Std. Err. t P>|t| [95% Conf. Interval]• -------------+----------------------------------------------------------------• ice | .0098603 .0059994 1.64 0.101 -.0019422 .0216629• quantity | -1.11e-07 1.70e-07 -0.66 0.512 -4.45e-07 2.22e-07• _cons | .2517135 .0195727 12.86 0.000 .2132082 .2902188• -------------+----------------------------------------------------------------• rho | .9436986• ------------------------------------------------------------------------------• Durbin-Watson statistic (original) 0.210907• Durbin-Watson statistic (transformed) 1.977062
ARIMA
• The ARIMA model allows us to test the hypothesis of autocorrelation and remove it from the data.
• This is an iterative process akin to the purging we did when creating the ystar variable.
The model
Significant lagEstimate of rho
ARIMA regression
Sample: 1 to 328 Number of obs = 328 Wald chi2(1) = 3804.80Log likelihood = 811.6018 Prob > chi2 = 0.0000
------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------price | _cons | .2558135 .0207937 12.30 0.000 .2150587 .2965683-------------+----------------------------------------------------------------ARMA | ar | L1. | .9567067 .01551 61.68 0.000 .9263076 .9871058-------------+---------------------------------------------------------------- /sigma | .0203009 .000342 59.35 0.000 .0196305 .0209713------------------------------------------------------------------------------
The residuals of the ARIMA model
There are a few significant lags a ways back. Generally we should expect some, but this mess is probably an indicator of a seasonal trend (well beyond the scope of this lecture)!
-0.2
0-0
.10
0.0
00
.10
0.2
0
Au
tocorr
ela
tion
s o
f e
0 10 20 30 40Lag
Bartlett's formula for MA(q) 95% confidence bands
ARIMA with a covariateARIMA regression
Sample: 1 to 328 Number of obs = 328 Wald chi2(3) = 3569.57Log likelihood = 812.9607 Prob > chi2 = 0.0000
------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------price | ice | .0095013 .0064945 1.46 0.143 -.0032276 .0222303 quantity | -1.04e-07 1.22e-07 -0.85 0.393 -3.43e-07 1.35e-07 _cons | .2531552 .0220777 11.47 0.000 .2098838 .2964267-------------+----------------------------------------------------------------ARMA | ar | L1. | .9542692 .01628 58.62 0.000 .9223611 .9861773-------------+---------------------------------------------------------------- /sigma | .0202185 .0003471 58.25 0.000 .0195382 .0208988------------------------------------------------------------------------------
Final thoughts
• Each correction has a “best” application.– If we wanted to evaluate a mean shift (dummy
variable only model), calculating rho will not be a good choice. Then we would want to use the lagged dependent variable
– Also, where we want to test the effect of inertia, it is probably better to use the lag
Final Thoughts Continued
– In Small N, calculating rho tends to be more accurate – ARIMA is one of the best options, however, it is very
complicated!– When dealing with time, the number of time periods
and the spacing of the observations is VERY IMPORTANT!
– When using estimates of rho, a good rule of thumb is to make sure you have 25-30 time points at a minimum. More if the observations are too close for the process you are observing!
Next Time:
• Review for Exam– Plenary Session
• Exam Posting– Available after class Wednesday