Date post: | 14-Oct-2014 |
Category: |
Documents |
Upload: | daniel-le-clere |
View: | 110 times |
Download: | 4 times |
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation
2
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation
3
Standard error/deviation is the
square root of the variance
1
)(
1
)(2
1
2
2
n
xx
n
xx
SS
i
n
i
i
xx
x
14
(It is very close to a typical departure of x from mean.
„standard‟ = „typical‟; „deviation/error‟ = departures from mean)
n
xx
S
i
x
||
x
15
Fill In the Signs of Deviations from
Means for Different Quadrants
x
y
I
IVIII
II00 yyxx ii 00 yyxx ii
00 yyxx ii00 yyxx ii
21
Fill In the Signs of Deviations from
Means for Different Quadrants
x
y
I
IVIII
II00 yyxx ii 00 yyxx ii
00 yyxx ii00 yyxx ii
22
The Products are Positive in I and III
x
y
I
IVIII
II
0))(( yyxx ii 0))(( yyxx ii
0))(( yyxx ii 0))(( yyxx ii
23
The Products are Negative in II and IV
x
y
I
IVIII
II
0))(( yyxx ii 0))(( yyxx ii
0))(( yyxx ii 0))(( yyxx ii
24
Sample Covariance, Sxy, describes
the Relationship between X and Y
))(()1(
1yyxx
nS iixy
If Sxy > 0 most data lies in I and III:
This concurs with our visual common sense because
it looks like a positive relationship
If Sxy < 0 most data lies in II and IV
This concurs with our visual common sense because
it looks like a negative relationship
If Sxy = 0 data is „evenly spread‟ across I-IV25
Correlation, rXY, is a Measure of
Relationship that is Unit-less
It can be proved that it lies between -1 and 1.
It has the same sign as SXY so ….
YX
XYXY
SS
Sr
11 XYr
31
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
33
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
What does correlation = -.2264 mean anyway?
34
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives a number for 1→1 association
3. Simple Regression is Better than Correlation
35
What is Simple Regression?
Simple regression allows us answer all three
questions:
“How much does Y change when X changes?”
“What is a good guess of Y if X =25?”
“What does correlation = -.2264 mean anyway?”
…by fitting a straight line to data
on two variables, Y and X.
XbbY .1036
We Get our Guessed Line Using
„(Ordinary) Least Squares [OLS]‟
OLS minimises the squared difference between a
regression line and the observations.
We can view these squared differences as squares.
This task then becomes the minimisation of the
area of the squares.
Applet: http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
3838
Measures of Fit
(Section 4.3)
The regression R2 can be seen from the applet
http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
It is the proportional reduction in the sum of squares as one
moves from modeling Y by a constant (with LS estimator
being a sample mean, and sum of squares equal to „total sum
of squares‟) to a line. R2=[TSS-„sum of squares‟]/TSS
If the model fits perfectly „sum of squares‟ = 0 and R2=1
If model does no better than a constant it = TSS and R2=0
The standard error of the regression (SER) measures the
magnitude of a typical regression residual in the units of Y.
39
The Standard Error of the
Regression (SER)
The SER measures the spread of the distribution of u. The SER
is (almost) the sample standard deviation of the OLS residuals:
SER = 2
1
1ˆ ˆ( )
2
n
i
i
u un
= 2
1
1ˆ
2
n
i
i
un
(the second equality holds because u = 1
1ˆ
n
i
i
un
= 0).
40
SER = 2
1
1ˆ
2
n
i
i
un
The SER:
has the units of u, which are the units of Y
measures the average “size” of the OLS residual (the average
“mistake” made by the OLS regression line)
Don‟t worry about the n-2 (instead of n-1 or n) – the reason is
too technical, and doesn‟t matter if n is large.
41
The OLS Line has a Small Negative
Slope
Estimated slope = 1ˆ = – 2.28
Estimated intercept = 0ˆ = 698.9
Estimated regression line: TestScore = 698.9 – 2.28 STR
43
Interpretation of the estimated slope and intercept
Test score= 698.9 – 2.28 STR
Districts with one more student per teacher on average have
test scores that are 2.28 points lower.
That is, Test score
STR = –2.28
The intercept (taken literally) means that, according to this
estimated line, districts with zero students per teacher would
have a (predicted) test score of 698.9.
This interpretation of the intercept makes no sense – it
extrapolates the line outside the range of the data – here, the
intercept is not economically meaningful.
44
Remember Calculus?Test score = 698.9 – 2.28 STR
Differentiation gives dSTR
scoredTest
= –2.28
„d‟ means „infinitely small change‟ but for a „very small
change‟ called „ ‟ it will still be pretty close to the truth. So,
an approximation is:
Test score
STR = –2.28
How to interpret this? Take denominator over the other side.
STRscoreTest 28.2
So, if STR goes up by one, Test score falls by 2.28.
If STR goes up by, say, 20, Test score falls by 2.28 (20)=45.6 45
Predicted values & residuals:
One of the districts in the data set is Antelope, CA, for which
STR = 19.33 and Test Score = 657.8
predicted value: ˆAntelopeY = 698.9 – 2.28 19.33 = 654.8
residual: ˆAntelopeu = 657.8 – 654.8 = 3.0
46
R2 and SER evaluate the Model
TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6
By using STR, you only reduce the sum of squares by 5%
compared with just ‘modeling’ Test score by its average. That is,
STR only explains a small fraction of the variation in test scores.
The standard residual size is 19, which looks large. 47
48
Seeing R2 and the SER in EVIEWs Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
TESTSCR=C(1)+C(2)*STR
Coefficient Std. Error t-Statistic Prob.
C(1) 698.9330 9.467491 73.82451 0.0000
C(2) -2.279808 0.479826 -4.751327 0.0000
R-squared 0.051240 Mean dependent var 654.1565
Adjusted R-squared 0.048970 S.D. dependent var 19.05335
S.E. of regression 18.58097 Akaike info criterion 8.686903
Sum squared resid 144315.5 Schwarz criterion 8.706143
Log likelihood -1822.250 Durbin-Watson stat 0.129062
= 698.9 – 2.28.STR
Recap
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes?
What is a good guess of Y if X =25?
What does correlation = -.2264 mean anyway?
49
Recap
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?
50
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?
Surprise: R2=rXY2
51
Outline
1. Scatterplots are Pictures of 1→1 association
2. Correlation gives number for 1→1 association
3. Simple Regression is Better than Correlation
But…
How much does Y change when X changes? b1 x
What is a good guess of Y if X =25? b0+b1(25)
What does correlation = -.2264 mean anyway?
Surprise: R2=rXY2
52
What is Simple Regression?
We‟ve used Simple regression as a means of
describing an apparent relationship between two
variables. This is called descriptive statistics.
Simple regression also allows us to estimate, and
make inferences, under the OLS assumptions,
about the slope coefficients of an underlying
model. We do this, as before, by fitting a straight
line to data on two variables, Y and X. This is
called inferential statistics.
54
The Underlying Model
(or „Population Regression Function‟)
Yi = 0 + 1Xi + ui, i = 1,…, n
X is the independent variable or regressor
Y is the dependent variable
0 = intercept
1 = slope
ui = the regression error or residual
The regression error consists of omitted factors, or possibly
measurement error in the measurement of Y. In general, these
omitted factors are other factors that influence Y, other than
the variable X 55
What Does it Look Like in This Case?
Yi = 0 + 1Xi + ui, i = 1,…, n
X is the STR
Y is the Test score
0 = intercept
1 = Test score
STR
= change in test score for a unit change in STR
If we also guess 0 we can also predict Test score when STR
has a particular value.
Clearly, we want good guesses (estimates) of 0 and 1.
56
From Now on we Use „b‟ or „ ‟ to Signify our Guesses, or „Estimates‟ of the Slope or Intercept.We never see the True Line.
ˆ
b0+b1x
1u
2u
58
From Now on we Use „b‟ or „ ‟ to Signify our Estimates of the Slope or Intercept and for guesses for u. We never see the True Line or u‟s.
ˆ
b0+b1x
1u
2u
59
Least squares estimators have a distribution; they are different every time you take a different sample. (like an average of 5 heights, or 7 exam marks)
The estimators are Random Variables. A Random variable generate numbers with a central measure called a mean and a volatility called the Standard Errors
Least squares estimators b0 & b1 have means
Hypothesis testing:
Eg. How to test if the slope is zero, or -37?
Confidence intervals:
Eg. What is a reasonable range of guesses for the slope ?
Our Estimators are Really Random
10 &
1
1
60
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
61
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
62
Outline
1. OLS Assumptions (Very Technical)
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
63
Outline
1. OLS Assumptions (When will OLS be ‘good’?)
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
64
Estimator Distributions Depend on
Least Squares Assumptions
A key part of the model is the assumptions made
about the residuals ut for t=1,2….n.
1. E(ut)=0
2. E(ut2)=σ2 =SER2 (note: not σ2
t – invariant)
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal
65
SW has different assumptions;
Use mine for any Discussions
The conditional distribution of u given X has
mean zero, that is, E(u|X = x) = 0. (a combination
of 1. and 4.)
(Xi,Yi), i =1,…,n, are i.i.d. (unnecessary in many
applications)
Large outliers in X and/or Y are rare. (technical
assumption)
66
How reasonable
are these assumptions?
To answer, we need to understand them.
1. E(ut)=0
2. E(ut2)=σ2 =SER2 (note: not σ2
t – invariant)
3. E(utus)=0 t≠s
4. Cov(Xt, ut)=0
5. ut~Normal
67
It‟s All About u
u is everything left out of the model
E(ut)=0
1. E(ut2)=σ2 =SER2 (note: not σ2
t – invariant)
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal
68
1. E(ut)=0 is not a big deal
Providing the model has a constant, this is not a
restrictive assumption.
If „all the other influences‟ don‟t have a zero
mean, the estimated constant will just adjust to
the point u does have a zero mean.
Really, B0+u could be thought of as everything
else that affects y apart from x
69
2. E(ut2)=σ2 =SER2 is Controversial
If this assumption holds, the errors are said to be
homoskedastic
If it is violated, the errors are said to be
heteroskedastic (hetero for short)
There are many conceivable forms of hetero, but
perhaps the most common is when the variance
depends upon the value of x
70
3. E(utus)=0 t≠s
A violation of this is called autocorrelation
If the underlying model generates data for a time
series, it is highly likely that „left out‟ variables
will be autocorrelated (i.e. z depends on lagged z;
most time series are like this) and so u will be too.
But if the model describes a cross-section
assumption 3 is likely to hold.
73
Aside: Hetero and Auto are not a
DisasterHetero plagues cross-sectional data, Auto plagues
time series.
Remarkably, Heteroskedasticity and
Autocorrelation do not bias the Least Squares
Estimators.
This is a very strange result!
74
If You Could See the True Line
You‟d Realize hetero is bad for OLS
. y (a) homoskedasticity (b) heteroskedasticity
x
76
But OLS is Still Unbiased!In case (b), OLS is still unbiased because the next draw
is just as likely to find the third error above the true
line, pulling up the (negative) slope of the least squares
line. On average, the true line would be revealed
with many samples.
y (a) (b)
x
But we will make an adjustment to our analysis later OLS is no
longer „best‟ which means minimum variance
77
Conquer Hetero and Auto with Just
One Click
SW recommend you correct standard errors for hetero and auto. In EVIEWs you do this by:
estimate/options/heteroskedasticity consistent coefficient covariance/ input: leave white ticked if only worried about hetero. Tick Newey West to correct for both.
Because OLS is unbiased, the correction only occurs for the standard errors.
Sometimes, we will use standard errors corrected in this way
78
4. Cov(Xt, ut)=0
This will be discussed extensively next lecture
When there is only one variable in a regression, it is highly likely that that variable will be correlated with a variable that is left out of the model, which is implicitly in the error term.
Before proceeding with assumption 5, it is worth stating that 1. – 4. are all that are required to prove the so-called Gauss-Markov Theorem, that OLS is Best Linear Unbiased Estimator (SW Section 5.5)
79
5. ut~Normal
With this assumption, OLS is minimum volatility
estimator among all consistent estimators.
Many variables are Normal
http://rba.gov.au/Statistics/AlphaListing/index.html
The assumption „delivers‟ a known distribution of OLS
estimators (a „t‟ distribution) if n is small. But if n is large
(>30) the OLS estimators become Normal, so it is
unnecessary. This is due to the Central Limit Theorem
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
http://www.rand.org/statistics/applets/clt.html
80
Assessment of Assumptions1. E(ut)=0 harmless if model has constant
2. E(ut2)=σ2 =SER2 not too serious since
3. E(utus)=0 t≠s OLS still unbiased
4. Cov(Xt, ut)=0 Serious – see next lecture
5. ut~Normal Nice Property to have, but if
sample size is big it doesn’t matter
We assume 2. and 3. hold, or just adjust the standard errors. We‟ll also assume n is large and we‟ll always keep a constant, so 5. and 1. are not relevant. This lecture, we assume 4 holds.
81
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
82
87
EVIEWs output gives us SE(B1)
Dependent Variable: TESTSCR
Method: Least Squares
Date: 06/04/08 Time: 22:13
Sample: 1 420
Included observations: 420 Coefficient Std. Error t-Statistic Prob.
C 698.9330 9.467491 73.82451 0.0000
STR -2.279808 0.479826 -4.751327 0.0000
R-squared 0.051240 Mean dependent var 654.1565
Adjusted R-squared 0.048970 S.D. dependent var 19.05335
S.E. of regression 18.58097 Akaike info criterion 8.686903
Sum squared resid 144315.5 Schwarz criterion 8.706143
Log likelihood -1822.250 Hannan-Quinn criter. 8.694507
F-statistic 22.57511 Durbin-Watson stat 0.129062
Prob(F-statistic) 0.000003
87
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
88
89
EVIEWs Output Can be Summarized
in Two Lines
Put standard errors in parentheses below the estimated
coefficients to which they apply.
TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6
(10.4) (0.52)
This expression gives a lot of information
The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of 0ˆ is 10.4
The standard error of 1ˆ is 0.52
The R2 is .05; the standard error of the regression is 18.6
9.4675 0.4798
9.4675
0.4798
89
90
We Only Need Two Numbers For
Hypothesis Testing
Put standard errors in parentheses below the estimated
coefficients to which they apply.
TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6
(10.4) (0.52)
This expression gives a lot of information
The estimated regression line is
TestScore = 698.9 – 2.28 STR
The standard error of 0ˆ is 10.4
The standard error of 1ˆ is 0.52
The R2 is .05; the standard error of the regression is 18.6
9.4675 0.4798
9.4675
0.4798
90
91
Remember Hypothesis Testing?
1. H0 = null hypothesis = „status quo‟ belief = what you believe without good reason to doubt it.
2. H1 = alternative hypothesis = what you believe if you reject H0
3. Collect evidence and create a calculated Test Statistic
4. Decide on a significance level = test size = Prob(type I error) =
5. The test size defines a rejection region and a critical value (the changeover point)
6. Reject H0 if Test Statistic lies in Rejection Region
92
Hypothesis Testing and the Standard
Error of (Section 5.1)
The objective is to test a hypothesis, like 1 = 0, using data – to
reach a tentative conclusion whether the (null) hypothesis is
correct or incorrect.
General setup
Null hypothesis and two-sided alternative:
H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized value under the null.
Null hypothesis and one-sided alternative:
H0: 1 = 1,0 vs. H1: 1 < 1,0
1ˆ
92
93
General approach: construct t-statistic, and compute p-value (or
compare to N(0,1) critical value)
In general: t = estimator - hypothesized value
standard error of the estimator
where the SE of the estimator is the square root of an
estimator of the variance of the estimator.
For testing 1, t = 1 1,0
1
ˆ
ˆ( )SE ,
where SE( 1ˆ ) = the square root of an estimator of the variance
of the sampling distribution of 1ˆ
Comparing distance between estimate and your hypothesized
value is obvious; doing it in units of volatility is less so.
93
94
Summary: To test H0: 1 = 1,0 vs.
H1: 1 ≠ 1,0,
Construct the t-statistic
t = 1 1,0
1
ˆ
ˆ( )SE
Reject at 5% significance level if |t| > 1.96
This procedure relies on the large-n approximation; typically
n = 30 is large enough for the approximation to be excellent.
94
95
p-values are another method
See textbook pp 72-81, and look up p-value in the index
1. What values of the test statistic would make you more
determined to reject the null than you are now?
2. If the null is true, what is the probability of obtaining
those values? This is the p-value.
“the p-value, also called the significance probability [not in
QBA] is the probability of drawing a statistic at least as
adverse to the null hypothesis as the one you actually
computed in your sample, assuming the null hypothesis is
correct” pg. 73
95
96
p-values are another method
See textbook pp 72-81, and look it up in the index
For a two sided test, p-value is p = Pr[|t| > |tact
|] = probability
in tails of normal outside |tact
|;
you reject at the 5% significance level if the p-value is < 5%
(or < 1% or <10% depending on test size)
REJECT H0 IF PVALUE <
96
97
Example: Test Scores and STR,
California data
Estimated regression line: TestScore = 698.9 – 2.28STR
Regression software reports the standard errors:
SE( 0ˆ ) = 10.4 SE( 1
ˆ ) = 0.52
t-statistic testing 1,0 = 0 = 1 1,0
1
ˆ
ˆ( )SE =
2.28 0
0.52 = –4.38
The 1% 2-sided significance level is 2.58, so we reject the null
at the 1% significance level.
Alternatively, we can compute the p-value…
(The standard errors are corrected for heteroskedasticity)
97
98
The p-value based on the large-n standard normal approximation
to the t-statistic is 0.00001 (10–5
)
98
Hypothesis Testing Can be Tricky
Dependent Variable: TESTSCR
Method: Least Squares
Date: 06/05/08 Time: 22:35
Sample: 1 420
Included observations: 420 Coefficient Std. Error t-Statistic Prob.
C 655.1223 1.126888 581.3553 0.0000
COMPUTER -0.003183 0.002106 -1.511647 0.1314
„Prob‟ only equals p-value for a two sided test
99
Try These Hypotheses
(a) H0:B1=0, H1:B1>0, with =.05 using the critical-values approach
(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach
(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach
(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach
(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach
(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach
(g) H0:B1=-.05, H1:B1<-.05, with =.10
100
(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach
This is very hard to do with p-values!
104
Outline
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
108
95% Confidence Intervals Catch the
True Parameter 95% of the Time
0.95. is)ˆSE( 1.96.ˆ interval
random by the captured be willy probabilit the,
95.))ˆSE(.96.1ˆ)ˆSE( 1.96.ˆProb(
95.))ˆSE(.96.1ˆ)ˆSE( Prob(1.96.
95.))ˆSE(.96.1-ˆ)ˆSE( .Prob(-1.96
95.)96.1)ˆSE(
-ˆProb(-1.96
So
http://bcs.whfreeman.com/bps4e/content/cat_010/applets/confidenceinterval.html
111
Confidence Intervals are
Reasonable Ranges
. size with testcesignifican oftest
sided- twoain nulls as rejected benot could that valuesof range a as
Interval Confidence -1 a can weother way, theGoing
CI. 95% ain liemust saysjust But this
)ˆSE(.961ˆ)ˆSE(.961ˆ
96.1)ˆSE(
ˆ96.196.1
)ˆSE(
-ˆ96.1
96.1)ˆSE(
-ˆ impliesIt
5%say at, :H offavour in :Hreject cannot weIf 10
define
112
113
Confidence interval example: Test Scores and STR
Estimated regression line: TestScore = 698.9 – 2.28 STR
SE( 0ˆ ) = 10.4 SE( 1
ˆ ) = 0.52
95% confidence interval for 1ˆ :
{ 1ˆ 1.96 SE( 1
ˆ )} = {–2.28 1.96 0.52}
= (–3.30, –1.26)
113
If You Make 1→1 Associations, Use
Simple Regression, not Correlation
1. OLS Assumptions
2. OLS Sampling Distribution
3. Hypothesis Testing
4. Confidence Intervals
But be careful of the Simple Regression assumption
Cov(Xt, ut)=0
114
116
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
116
117
It‟s all about u
(SW Section 6.1)
The error u arises because of factors that influence Y but are not
included in the regression function; so, there are always omitted
variables.
Sometimes, the omission of those variables can lead to bias in the
OLS estimator. This occurs because the assumption
4. Cov(Xt, ut)=0
Is violated
117
118
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
118
119
Omitted variable bias=OVB
The bias in the OLS estimator that occurs as a result of an
omitted factor is called omitted variable bias.
Let y= 0+ 1x+u and let u=f(Z)
omitted variable bias is a problem if the omitted factor “Z” is:
1. A determinant of Y (i.e. Z is part of u); and
2. Correlated with the regressor X (i.e. corr(Z,X) 0)
Both conditions must hold for the omission of Z to result in
omitted variable bias.
119
What Causes Long Life?
Gapminder (http://www.gapminder.org/world) is an online
applet that contains demographic information about each
country in the world.
Suppose that we are interested in predicting life
expectancy, and think that both income per capita and the
number of physicians per 1000 people would make good
indicators.
Our first step would be to graph these predictors against
life expectancy
We find that both are positively correlated with life expectancy
120120
…Doctors or Income or Both?
Simple Linear Regression only allows us to use
one of these predictors to estimate life expectancy.
But income per capita is correlated with the
number of physicians per 1000 people. Suppose
the truth is:
Life=B0+B1Income+B2Doctors+u but you run
Life=B0+B1Income+u* (u*=B2Doctors+u)
121121
OVB=„Double Counting‟
B1 is the impact of Income on Life, holding
everything else constant including the residual
But if correlation exists between the Doctors (in
the residual) and income (rIncDoct≠0 ), and, if the
true impact of Doctors (B2≠0) is non-zero, then B1
counts both effects – it „double counts‟
Life=B0+B1Income+u* (u*=B2Doctors+u)
122122
123
Our Test score Reg has OVB
In the test score example:
1. English language deficiency (whether the student is learning
English) plausibly affects standardized test scores: Z is a
determinant of Y.
2. Immigrant communities tend to be less affluent and thus
have smaller school budgets – and higher STR: Z is
correlated with X.
Accordingly, 1ˆ is biased.
123
What is the bias? We have a formulaSTR is larger for those classes with a higher PctEL (both being a feature of poorer areas), the correlation between STR and PctEL will be positive
PctEL appears in u with a negative sign in front of it – higher PctEL leads to lower scores. Therefore the correlation between STR and u[ minus PctEL], must be negative (ρXu < 0).
Here is the formula. (Standard deviations are always positive)
the volatility of the error and the
included variable matter
So the coefficient of student-teacher ratio is negatively biased by the exclusion of the percentage of English learners. It is „too big‟ in absolute value.
0
ˆ
Xu
X
u
XXBias
124
125
Including PctEL Solves Problem
Some ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment
(STR) is randomly assigned: then PctEL is still a determinant
of TestScore, but PctEL is uncorrelated with STR. (But this is
unrealistic in practice.)
2. Adopt the “cross tabulation” approach, with finer gradations
of STR and PctEL – within each group, all classes have the
same PctEL, so we control for PctEL (But soon we will run
out of data, and what about other determinants like family
income and parental education?)
3. Use a regression in which the omitted variable (PctEL) is no
longer omitted: include PctEL as an additional regressor in a
multiple regression.
125
126
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
126
127
The Population Multiple Regression
Model (SW Section 6.2)
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
Y is the dependent variable
X1, X2 are the two independent variables (regressors)
(Yi, X1i, X2i) denote the ith
observation on Y, X1, and X2.
0 = unknown population intercept
1 = effect on Y of a change in X1, holding X2 constant
2 = effect on Y of a change in X2, holding X1 constant
ui = the regression error (omitted factors)
127
128
Partial Derivatives in Multiple
Regression = Cet. Par. in Economics
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
We can use calculus to interpret the coefficients:
1 =1X
Y, holding X2 constant= Ceteris Paribus
2 = 2X
Y, holding X1 constant = Ceteris Paribus
0 = predicted value of Y when X1 = X2 = 0.
128
129
The OLS Estimator in Multiple
Regression (SW Section 6.3)
With two regressors, the OLS estimator solves:
0 1 2
2
, , 0 1 1 2 2
1
min [ ( )]n
b b b i i i
i
Y b b X b X
The OLS estimator minimizes the average squared difference
between the actual values of Yi and the prediction (predicted
value) based on the estimated line.
This minimization problem is solved using calculus
This yields the OLS estimators of 0 , 1 and 2.
129
130
Multiple regression in EViews
TestScore = 686.0 – 1.10 STR – 0.65 PctEL
More on this printout later…
Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
White Heteroskedasticity-Consistent Standard Errors & Covariance
TESTSCR=C(1)+C(2)*STR+C(3)*EL_PCT
Coefficient Std. Error t-Statistic Prob.
C(1) 686.0322 8.728224 78.59930 0.0000
C(2) -1.101296 0.432847 -2.544307 0.0113
C(3) -0.649777 0.031032 -20.93909 0.0000
R-squared 0.426431 Mean dependent var 654.1565
Adjusted R-squared 0.423680 S.D. dependent var 19.05335
S.E. of regression 14.46448 Akaike info criterion 8.188387
Sum squared resid 87245.29 Schwarz criterion 8.217246
Log likelihood -1716.561 Durbin-Watson stat 0.685575
130
131
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
131
132
Measures of Fit for Multiple
Regression (SW Section 6.4)
R2 now becomes the square of the correlation coefficient
between y and predicted y.
It is still the proportional reduction in the residual sum of
squares as we move from modeling y with just a sample
mean, to modeling it with a group of variables.
132
133
R2 and 2R
The R2 is the fraction of the variance explained – same definition
as in regression with a single regressor:
R2 =
ESS
TSS = 1
SSR
TSS,
where ESS = 2
1
ˆ ˆ( )n
i
i
Y Y , SSR = 2
1
ˆn
i
i
u , TSS = 2
1
( )n
i
i
Y Y .
The R2 always increases when you add another regressor
(why?) – a bit of a problem for a measure of “fit”
133
134
R2 and
The 2R (the “adjusted R2”) corrects this problem by “penalizing”
you for including another regressor – the 2R does not necessarily
increase when you add another regressor.
Adjusted R2:
2R = 1
11
n SSR
n k TSS =
11
1
n SSR
n k TSS
Note that 2R < R2, however if n is large the two will be very
close.
(1-R2)
2R
134
135
Measures of fit, ctd.
Test score example:
(1) TestScore = 698.9 – 2.28STR,
R2 = .05, SER = 18.6
(2) TestScore = 686.0 – 1.10STR – 0.65PctEL,
R2 = .426, 2R = .424, SER = 14.5
What – precisely – does this tell you about the fit of regression
(2) compared with regression (1)?
Why are the R2 and the 2R so close in (2)?
STR
STR
135
136
Outline
1. Omitted variable bias
2. Multiple regression and OLS
3. Measures of fit
4. Sampling distribution of the OLS estimator
136
137
Sampling Distribution Depends on
Least Squares Assumptions (SW Section 6.5)
yi=B0+B1x1i+B2x2i+……Bkxki+ui
• E(ut)=0
1. E(ut2)=σ2 =SER2 (note: not σ2
t – invariant)
2. E(utus)=0 t≠s
3. Cov(Xt, ut)=0
4. ut~Normal plus
5. There is no perfect multicollinearity
137
138
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
Example: Suppose you accidentally include STR twice:
138
139
Perfect multicollinearity is when one of the regressors is an
exact linear function of the other regressors.
In such a regression (where STR is included twice), 1 is the
effect on TestScore of a unit change in STR, holding STR
constant (???)
The Standard Errors become Infinite when perfect
multicollinearity exists
139
140
OLS Wonder Equation
)1(
1)(
2
on
ˆ
Xxxi
ui
iRnS
SbSE
• Multicollinearity increases R2 and therefore
increases variance of bi
Perfect multicollinearity (R2=1) makes
regression impossible
You expect a low standard error the more
variables you add to a regression. The more
you add, the higher the R-squared on the
denominator becomes, because it always
rises with extra variables.140
141
Quality of Slope Estimate (R2
and fixed)High SE(bi) Low SE(bi) Low SE(bi)
xi xi xi
Sxi2 = Sxi
2 < Sxi2
n=6 n=20 n=6
uS ˆ
141
142
The Sampling Distribution of the
OLS Estimator (SW Section 6.6)
Under the Least Squares Assumptions,
The exact (finite sample) distribution of 1ˆ has mean 1,
var( 1ˆ ) is inversely proportional to n; so too for 2
ˆ .
Other than its mean and variance, the exact (finite-n)
distribution of 1ˆ is very complicated; but for large n…
1ˆ is consistent: 1
ˆ p
1 (law of large numbers)
)ˆ(
ˆ
1
11
SE is approximately distributed N(0,1) (CLT)
So too for 2ˆ ,…, ˆ
k
Conceptually, there is nothing new here!
142
143
Multicollinearity, Perfect and
Imperfect (SW Section 6.7)
Some more examples of perfect multicollinearity
The example from earlier: you include STR twice.
Second example: regress TestScore on a constant, D, and
Bel, where: Di = 1 if STR ≤ 20, = 0 otherwise; Beli = 1 if STR
>20,
= 0 otherwise, so Beli = 1 – Di and there is perfect
multicollinearity because Bel+D=1 (the 1 „variable‟ for the
constant)
To fix this, drop the constant
143
144
Perfect multicollinearity, ctd.
Perfect multicollinearity usually reflects a mistake in the
definitions of the regressors, or an oddity in the data
If you have perfect multicollinearity, your statistical software
will let you know – either by crashing or giving an error
message or by “dropping” one of the variables arbitrarily
The solution to perfect multicollinearity is to modify your list
of regressors so that you no longer have perfect
multicollinearity.
144
145
Imperfect multicollinearity
Imperfect and perfect multicollinearity are quite different despite
the similarity of the names.
Imperfect multicollinearity occurs when two or more regressors
are very highly correlated.
Why this term? If two regressors are very highly
correlated, then their scatterplot will pretty much look like a
straight line – they are collinear – but unless the correlation
is exactly 1, that collinearity is imperfect.
145
146
Imperfect multicollinearity, ctd.
Imperfect multicollinearity implies that one or more of the
regression coefficients will be imprecisely estimated.
Intuition: the coefficient on X1 is the effect of X1 holding X2
constant; but if X1 and X2 are highly correlated, there is very
little variation in X1 once X2 is held constant – so the data are
pretty much uninformative about what happens when X1
changes but X2 doesn‟t, so the variance of the OLS estimator
of the coefficient on X1 will be large.
Imperfect multicollinearity (correctly) results in large
standard errors for one or more of the OLS coefficients as
described by the OLS wonder equation
Next topic: hypothesis tests and confidence intervals…
146
Portion of X that “explains” Y
Y
X
High R2
For any two circles,
the overlap tells the
size of the R2
147
Portion of X that “explains” Y
Y
X
LowR2
For any two circles,
the overlap tells the
size of the R2
148
Imperfect (but high)
multicollinearity
Y
X1
X2
Since X2 and X1 share a lot of
the same information, adding
X2 allows us to work out
independent effects better, but
we realize we don‟t have
much information (area) to do
this with. Larger n makes all
circles bigger and, as before,
the overlap tells the size of R2
)1(
1)(
2
on1
ˆ1
21 xxx
u
RnS
SbSE
150
Multiple Coefficients Tests?
We know how to obtain estimates of the
coefficients, and each one is a ceteris paribus („all
other things equal‟) effect
Why would we want to do hypotheses tests about
groups of coefficients?
yt = 0 + 1x1t + 2x2t + . . kxkt + et
152
Multiple Coefficients Tests
Example 1: Consider the statement that „this whole model is worthless‟.
One way of making that statement mathematically formal is to say
1 = 2=….= k=0
because if this is true then none of the variables x1, x2. . xk helps explain y
yt = 0 + 1x1t + 2x2t + . . kxkt + et
153
Multiple Coefficients Tests
Example 2: Suppose y is the share of the population that votes
for the ruling party and x1 and x2 is the spending on TV and
radio advertising.
The Prime Minister might want to know if TV is more effective than radio, as measured by the impact on the share of the popular vote for of an extra dollar spent on each. The way to write this mathematically is
1 > 2
yt = 0 + 1x1t + 2x2t + . . kxkt + et
154
Multiple Coefficients Tests
Example 3: Suppose y is the growth in GDP, x1 and x2 are the
cash rate one and two quarters ago, and that all the other X‟s are different macroeconomic variables.
Suppose we are interested in testing the effectiveness of monetary policy. One way of doing this is asking if the cash rate at any lag has an impact on GDP growth. Mathematically, this is
1 = 2 = 0
yt = 0 + 1x1t + 2x2t + . . kxkt + et
155
Multiple Coefficients Tests
In each case, we are interested in making statements about groups of coefficients.
What about just looking at the estimates?
Same problem as in t-testing. You ought to care about reliability.
What about sequential testing?
errors compound, even if possible (SW Sect. 7.2)
yt = 0 + 1x1t + 2x2t + . . kxkt + et
156
Multiple Coefficients Tests
The so-called F-test can do all of these
restrictions, except for example 2.
Before turning to the F-test, let‟s do example 2,
which can be done with a t-test
yt = 0 + 1x1t + 2x2t + . . kxkt + et
157
Example 2 Solution
If then . Sub this in.
yt = 0 + ( ) x1t + 2 x2t + . + et
= 0 + x1t + 2(x1t + x2t)+ . + et
So, to test 1 > 2 just run a new regression including
x1+x2 instead of x2 (everything else is left the same) and do a
t-test for H0: =0 vs. H1: >0. Naturally, if you accept H1: >0,
this implies 1 2 >0 which implies 1 > 2
This technique is called reparameterization
yt = 0 + 1x1t + 2x2t + . . kxkt + et
158
Restricted Regressions
One more thing before we do the F-test, we
must define a „restricted regression‟. This is
just the model you get when a hypothesis is
assumed true
yt = 0 + 1x1t + 2x2t + . . kxkt + et
159
Restricted Regression: Example
1
Example 1: Consider the statement that „this whole model is worthless‟.
If 1 = 2=….= k=0then the model is
yt = 0 + etand the restricted regression would be an OLS regression of y on a constant. The estimate for the constant will just be the sample mean of y.
yt = 0 + 1x1t + 2x2t + . . kxkt + et
160
If 1 = 2 = 0 then the model is
yt = 0 + 3x3t + . . kxkt + et
and the restricted regression is an OLS
regression of y on a constant and x3 to xk
Restricted Regression: Example 3
yt = 0 + 1x1t + 2x2t + . . kxkt + et
161
Properties of Restricted
Regressions
Imposing a restriction always increases the residual sum of squares, since you are forcing the estimates to take the values implied by the restriction, rather than letting OLS choose the values of the estimates to minimize the SSR
If the SSR increases a lot, it implies that the restriction is relatively „unbelievable‟. That is, the model fits a lot worse with the restriction imposed.
This last point is the basic intuition of the F-test –impose the restriction and see if SSR goes up „too much‟. http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html
162
The F-testTo test a restriction we need to run the restricted
regression as well as the unrestricted regression (i.e.
the original regression). Let q be the number of
restrictions.
Intuitively, we want to know if the change in SSR is
big enough to suggest the restriction is wrong
edunrestrictisur andrestrictedisr
where,k )1 nSSR
qSSRSSRF
ur
urr
163
The F statistic
The F statistic is always positive, since the SSR from the restricted model can‟t be less than the SSR from the unrestricted
Essentially the F statistic is measuring the relative increase in SSR when moving from the unrestricted to restricted model
q = number of restrictions
164
The F statistic (cont)
To decide if the increase in SSR when we move to
a restricted model is “big enough” to reject the
restrictions, we need to know about the sampling
distribution of our F stat
Not surprisingly, F ~ Fq,n-k-1, where q is referred to
as the numerator degrees of freedom and n – k-1 as
the denominator degrees of freedom
165
The R2 form of the F statistic
Because the SSR‟s may be large and unwieldy, an
alternative form of the formula is useful
We use the fact that SSR = TSS(1 – R2) for any
regression, so can substitute in for SSRu and SSRur
edunrestrictisur andrestrictedisr
again where,1
2
22
k-1nR
qRRF
ur
rur
168
Overall Significance (example 1)
A special case of exclusion restrictions is to test H0: 2 =
3 =…= k = 0
R2 =0 for a model with only an intercept
This is because the OLS estimator is just the sample mean,
implying the TSS=SSR
the F statistic is then
12
2
k-1nR
kRF
169
Dependent Variable: TESTSCR
Method: Least Squares
Date: 06/05/08 Time: 15:29
Sample: 1 420
Included observations: 420 Coefficient Std. Error t-Statistic Prob.
C 675.6082 5.308856 127.2606 0.0000
MEAL_PCT -0.396366 0.027408 -14.46148 0.0000
AVGINC 0.674984 0.083331 8.100035 0.0000
STR -0.560389 0.228612 -2.451272 0.0146
EL_PCT -0.194328 0.031380 -6.192818 0.0000
R-squared 0.805298 Mean dependent var 654.1565
Adjusted R-squared 0.803421 S.D. dependent var 19.05335
S.E. of regression 8.447723 Akaike info criterion 7.117504
Sum squared resid 29616.07 Schwarz criterion 7.165602
Log likelihood -1489.676 Hannan-Quinn criter. 7.136515
F-statistic 429.1152 Durbin-Watson stat 1.545766
Prob(F-statistic) 0.000000
12
2
k-1nR
kRF
[.8053/4]/[{1-.8053}/(420-5)] = 429170
General Linear Restrictions
The basic form of the F statistic will work for any
set of linear restrictions
First estimate the unrestricted model and then
estimate the restricted model
In each case, make note of the SSR
Imposing the restrictions can be tricky – will likely
have to redefine variables again
171
F Statistic Summary
Just as with t statistics, p-values can be calculated by looking up the percentile in the appropriate Fdistribution
If only one exclusion is being tested, then F = t2, and the p-values will be the same
F-tests are done mechanically – you don‟t have to do the restricted regressions (though you have to understand how to do them for this course).
172
F-tests are Easy in EVIEWsTo test hypotheses like these in EVIEWs, use the Wald test. After you run your regression, type „View, Coefficient tests, Wald‟
Try testing a single restriction (which you can use a t-test for) and see that t2=F, and, that the p-values are the same.
Try testing all the coefficients except the intercept are zero, and compare it with the F-test automatically calculated in EVIEWs.
SW discusses the shortcomings of F-tests at length. They crucially depend upon the assumption of homoskedasticity.
173
Start Big and Go SmallGeneral to Specific Modeling relies upon the fact that
omitted variable bias is a serious problem.
Start with a very big model to avoid OVB
Do t-tests on individual coefficients. Delete the most
insignificant, run the model again, delete the most
insignificant variable, run the model again, and so
on….until every individual coefficient is significant.
Finally, Do an F-test on the original model excluding all
the coefficients required to get to your final model at once.
If the null is accepted, you have verified the model.
Test for Hetero, and correct for it if need be.
174
176
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments
2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
176
177
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments
2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
177
178
Nonlinear Regression Population Regression
Functions – General Ideas (SW Section 8.1)
If a relation between Y and X is nonlinear:
The effect on Y of a change in X depends on the value of X –
that is, the marginal effect of X is not constant
A linear regression is mis-specified – the functional form is
wrong
The estimator of the effect on Y of X is biased – it needn‟t
even be right on average.
The solution to this is to estimate a regression function that is
nonlinear in X
178
179
Nonlinear Functions of a Single
Independent Variable (SW Section 8.2)
We‟ll look at two complementary approaches:
1. Polynomials in X
The population regression function is approximated by a
quadratic, cubic, or higher-degree polynomial
2. Logarithmic transformations
Y and/or X is transformed by taking its logarithm
this gives a “percentages” interpretation that makes sense
in many applications
179
180
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments
2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
180
181
2. Polynomials in X
Approximate the population regression function by a polynomial:
Yi = 0 + 1Xi + 22
iX +…+ rr
iX + ui
This is just the linear multiple regression model – except that
the regressors are powers of X!
Estimation, hypothesis testing, etc. proceeds as in the
multiple regression model using OLS
The coefficients are difficult to interpret, but the regression
function itself is interpretable
181
182
Example: the TestScore – Income
relation Incomei = average district income in the i
th district
(thousands of dollars per capita)
Quadratic specification:
TestScorei = 0 + 1Incomei + 2(Incomei)2 + ui
Cubic specification:
TestScorei = 0 + 1Incomei + 2(Incomei)2
+ 3(Incomei)3 + ui
182
183
Estimation of the quadratic
specification in EViews
Test the null hypothesis of linearity against the alternative that
the regression function is a quadratic….
Dependent Variable: TESTSCR
Method: Least Squares
Sample: 1 420
Included observations: 420
White Heteroskedasticity-Consistent Standard Errors & Covariance
TESTSCR=C(1)+C(2)*AVGINC + C(3)*AVGINC*AVGINC
Coefficient Std. Error t-Statistic Prob.
C(1) 607.3017 2.901754 209.2878 0.0000
C(2) 3.850995 0.268094 14.36434 0.0000
C(3) -0.042308 0.004780 -8.850509 0.0000
R-squared 0.556173 Mean dependent var 654.1565
Adjusted R-squared 0.554045 S.D. dependent var 19.05335
S.E. of regression 12.72381 Akaike info criterion 7.931944
Sum squared resid 67510.32 Schwarz criterion 7.960803
Log likelihood -1662.708 Durbin-Watson stat 0.951439
Create a quadratic regressor
183
184
Interpreting the estimated
regression function:
(a) Plot the predicted values
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27) (0.0048)
184
185
Interpreting the estimated
regression function, ctd: (b) Compute “effects” for different values of X
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27) (0.0048)
Predicted change in TestScore for a change in income from
$5,000 per capita to $6,000 per capita:
TestScore = 607.3 + 3.85 6 – 0.0423 62
– (607.3 + 3.85 5 – 0.0423 52)
= 3.4
185
186
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
Predicted “effects” for different values of X:
Change in Income ($1000 per capita) TestScore
from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0
The “effect” of a change in income is greater at low than high
income levels (perhaps, a declining marginal benefit of an
increase in school budgets?)
Caution! What is the effect of a change from 65 to 66?
Don’t extrapolate outside the range of the data!
186
187
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
Predicted “effects” for different values of X:
Change in Income ($1000 per capita) TestScore
from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0
Alternatively, dTestscore/dIncome = 3.85-.0846 (Income)
gives the same numbers (approx)
187
188
Summary: polynomial regression
functions Yi = 0 + 1Xi + 2
2
iX +…+ rr
iX + ui
Estimation: by OLS after defining new regressors
Coefficients have complicated interpretations
To interpret the estimated regression function:
plot predicted values as a function of x
compute predicted Y/ X at different values of x
Hypotheses concerning degree r can be tested by t- and F-
tests on the appropriate (blocks of) variable(s).
Choice of degree r
plot the data; t- and F-tests, check sensitivity of estimated
effects; judgment.
188
A Final Warning: Polynomials
Can Fit Too Well
When fitting a polynomial regression function, we
need to be careful not to fit too many terms, despite
the fact that a higher order polynomial will always
fit better.
If we do fit too many terms, then any prediction
may become unrealistic.
The following applet lets us explore fitting
different polynomials to some data.http://www.scottsarra.org/math/courses/na/nc/polyRegression.html
189189
3. Are Polynomials Enough?
We can investigate the appropriateness of a
regression function by graphing the regression
function over the top of the scatterplot.
For some models, we may need to transform the
data
For example, take logs of the response variable
The site below allows us to do this, exploring some
common regression functionshttp://www.ruf.rice.edu/%7Elane/stat_sim/transformations/index.html
190190
191
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments
2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
191
192
3. Logarithmic functions of Y and/or X
ln(X) = the natural logarithm of X
Logarithmic transforms permit modeling relations in
“percentage” terms (like elasticities), rather than linearly.
Here’s why:
Numerically:
ln(1.01)-ln(1) = .00995-0 =.00995 (correct % .01);
ln(40)-ln(45) = 3.6889-3.8067=-.1178 (correct % = -.1111)
xx
xx
xx
x
xx
x
in change alproportion)ln(1)ln(
1)ln(
192
193
Three log regression specifications:
Case Population regression function
I. linear-log Yi = 0 + 1ln(Xi) + ui
II. log-linear ln(Yi) = 0 + 1Xi + ui
III. log-log ln(Yi) = 0 + 1ln(Xi) + ui
The interpretation of the slope coefficient differs in each case.
The interpretation is found by applying the general “before
and after” rule: “figure out the change in Y for a given change
in X.”
193
194
Summary: Logarithmic
transformations
Three cases, differing in whether Y and/or X is transformed
by taking logarithms.
The regression is linear in the new variable(s) ln(Y) and/or
ln(X), and the coefficients can be estimated by OLS.
Hypothesis tests and confidence intervals are now
implemented and interpreted “as usual.”
The interpretation of 1 differs from case to case.
Choice of specification should be guided by judgment (which
interpretation makes the most sense in your application?),
tests, and plotting predicted values
194
195
„Linear‟ Regression = Linear in
Parameters, Not Nec. Variables
1. Nonlinear regression functions – general comments
2. Polynomials
3. Logs
4. Nonlinear functions of two variables: interactions
195
Regression when X is Binary
(Section 5.3)
Sometimes a regressor is binary:
X = 1 if small class size, = 0 if not
X = 1 if female, = 0 if male
X = 1 if treated (experimental drug), = 0 if not
Binary regressors are sometimes called “dummy” variables.
So far, 1 has been called a “slope,” but that doesn‟t make sense
if X is binary.
How do we interpret regression with a binary regressor?
196
Interpreting regressions with a
binary regressor
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
When Xi = 0, Yi = 0 + ui
the mean of Yi is 0
that is, E(Yi|Xi=0) = 0
When Xi = 1, Yi = 0 + 1 + ui
the mean of Yi is 0 + 1
that is, E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
197
198
Interactions Between Independent
Variables (SW Section 8.3)
Perhaps a class size reduction is more effective in some
circumstances than in others…
Perhaps smaller classes help more if there are many English
learners, who need individual attention
That is, TestScore
STR might depend on PctEL
More generally, 1
Y
X might depend on X2
How to model such “interactions” between X1 and X2?
We first consider binary X‟s, then continuous X‟s
198
199
(a) Interactions between two binary
variables
Yi = 0 + 1D1i + 2D2i + ui
D1i, D2i are binary
1 is the effect of changing D1=0 to D1=1. In this specification,
this effect doesn’t depend on the value of D2.
To allow the effect of changing D1 to depend on D2, include the
“interaction term” D1i D2i as a regressor:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
199
200
Interpreting the coefficients:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
It can be shown that
1D
Y
= 1 + 3D2
The effect of D1 depends on D2 (what we wanted)
3 = increment to the effect of D1 from a unit change in D2
200
201
Example: TestScore, STR, English
learnersLet
HiSTR = 1 if 20
0 if 20
STR
STR and HiEL =
1 if l0
0 if 10
PctEL
PctEL
TestScore = 664.1 – 18.2HiEL – 1.9HiSTR – 3.5(HiSTR HiEL)
(1.4) (2.3) (1.9) (3.1)
“Effect” of HiSTR when HiEL = 0 is –1.9
“Effect” of HiSTR when HiEL = 1 is –1.9 – 3.5 = –5.4
Class size reduction is estimated to have a bigger effect when
the percent of English learners is large
This interaction isn‟t statistically significant: t = 3.5/3.1
201
202
(b) Interactions between continuous
and binary variables
Yi = 0 + 1Xi + 2Di + ui
Di is binary, X is continuous
As specified above, the effect on Y of X (holding constant D) =
1, which does not depend on D
To allow the effect of X to depend on D, include the
“interaction term” Di Xi as a regressor:
Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui
202
203
Binary-continuous interactions: the
two regression lines
Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui
Observations with Di= 0 (the “D = 0” group):
Yi = 0 + 1Xi + ui The D=0 regression line
Observations with Di= 1 (the “D = 1” group):
Yi = 0 + 1Xi + 2 + 3Xi + ui
= ( 0+ 2) + ( 1+ 3)Xi + ui The D=1 regression line
203
204
Binary-continuous interactions, ctd.
D = 0D = 1 D = 1
D = 0
D = 0
D = 1
B3=0
All Bi non-zero
B2=0
204
205
Interpreting the coefficients:
Yi = 0 + 1Xi + 2Di + 3(Xi Di) + ui
Or, using calculus,
X
Y= 1 + 3D
The effect of X depends on D (what we wanted)
3 = increment to the effect of X1 from a change in the level
of D from D=0 to D=1
205
206
Example: TestScore, STR, HiEL
(=1 if PctEL 10)
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
When HiEL = 0:
TestScore = 682.2 – 0.97STR
When HiEL = 1,
TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR
= 687.8 – 2.25STR
Two regression lines: one for each HiSTR group.
Class size reduction is estimated to have a larger effect when
the percent of English learners is large.
206
207
Example, ctd: Testing hypotheses
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
The two regression lines have the same slope the
coefficient on STR HiEL is zero: t = –1.28/0.97 = –1.32
The two regression lines have the same intercept the
coefficient on HiEL is zero: t = –5.6/19.5 = 0.29
The two regression lines are the same population
coefficient on HiEL = 0 and population coefficient on
STR HiEL = 0: F = 89.94 (p-value < .001) !!
We reject the joint hypothesis but neither individual
hypothesis (how can this be?)
207
208
Summary: Nonlinear Regression
Functions
Using functions of the independent variables such as ln(X)
or X1 X2, allows recasting a large family of nonlinear
regression functions as multiple regression.
Estimation and inference proceed in the same way as in
the linear multiple regression model.
Interpretation of the coefficients is model-specific, but the
general rule is to compute effects by comparing different
cases (different value of the original X‟s)
Many nonlinear specifications are possible, so you must
use judgment:
What nonlinear effect you want to analyze?
What makes sense in your application?
208
Statistics Means Description and
Inference
Descriptive Statistics is about describing datasets.
Various visual tricks can distort these descriptions
Inferential Statistics is about statistical inference.
You know something about tricks to distort
inference (eg. Putting in lots of variables to raise
R2 or lowering to get in a variable you want).
210
Pitfalls of Analysis
There are several ways that misleading statistics
can occur (which effect both inferential and
descriptive statistics)
Obtaining flawed data
Not understanding the data
Not choosing appropriate displays of data
Fitting an inappropriate model
Drawing incorrect conclusions from analysis.
211
How to Display Data
• The golden rule for displaying data in a graph is to keep it simple
• Graphs should not have any chart junk.– “minimise the ratio of ink to data” - Tufte
• Axes should be chosen so they do not inflate or deflate the differences between observations– Where possible, start the Y-axis at 0
– If this is not possible then you should consider graphing the change in the observation from one period to the next
• Some general tips on how to properly display data can be found at http://lilt.ilstu.edu/gmklass/pos138/datadisplay/sections/goodcharts.htm
215
Incorrect Conclusions: Causality
Excess money supply (%) Increase in prices two years
later (%)
1965 4.7 1967 2.5
1966 1.9 1968 4.7
1967 7.8 1969 5.4
1968 4.0 1970 6.4
1969 1.3 1971 9.4
1970 7.8 1972 7.1
1971 11.4 1973 9.2
1972 23.4 1974 16.1
1973 22.2 1975 24.2
Correlation: 0.848
Source: Grenville and Macfarlane (1988) 217
Accompanying LetterSir,
Professor Lord Kaldor today (March 31) states that
“there is no historical evidence whatever” that the money
supply determines the future movement of prices with
a time lag of two years. May I refer Professor Kaldor to
your article in The Times of July 13, 1976.
Data
If one calculates the correlation between these two sets
of figures the coefficient r=0.848 and since there are seven
degrees of freedom the P value is less than 0.01. If Mr
Rees-Mogg‟s figures are correct, this would appear to a
biologist to be a highly significant correlation, for it means
that the probability of the correlation occurring by chance
is less than one in a hundred. Most betting men would
think that those were impressive odds.
Until Professor Kaldor can show a fallacy in the figures,
I think Mr Rees-Mogg has fully established his point.
Yours faithfully,
IVOR H. MILLS,
University of Cambridge Clinical School,
Department of Medicine,
218
ResponseSir,
Professor Mills today (April 4) uses correlation
analysis in your columns to attempt to resolve the
theoretical dispute over the cause(s) of inflation. He
cites a correlation coefficient of 0.848 between the rate
of inflation and the rate of change of “excess” money
supply two years before.
We were rather puzzled by this for we have always
believed that it was Scottish Dysentery that kept prices
down (with a one-year lag, of course). To reassure
ourselves, we calculated the correlation between the
following sets of figures:
219
Incorrect Conclusions: Causality
Correlation: -0.868Cases of Dysentery in
Scotland („000)
Increase in prices one year
later (%)
1966 4.3 1967 2.5
1967 4.5 1968 4.7
1968 3.7 1969 5.4
1969 5.3 1970 6.4
1970 3.0 1971 9.4
1971 4.1 1972 7.1
1972 3.2 1973 9.2
1973 1.6 1974 16.1
1974 1.5 1975 24.2
Source: Grenville and Macfarlane (1988) 220
A Final Warning
We have to inform you that the correlation coefficient
is -0.868 (which is statistically slightly more significant
than that obtained by Professor Mills). Professor Mills says
that “Until … a fallacy in the figures [can be shown], I
think Mr Rees-Mogg has fully established his point.” By
the same argument, so have we.
Yours faithfully.
G. E. J. LLEWELLYN, R. M. WITCOMB.
Faculty of Economics and Politics,
221