+ All Categories
Home > Documents > All Econonmetrics lectures

All Econonmetrics lectures

Date post: 14-Oct-2014
Category:
Upload: daniel-le-clere
View: 110 times
Download: 4 times
Share this document with a friend
221
Chapter 4 Linear Regression with One Regressor („Simple Regression‟)
Transcript

Chapter 4

Linear Regression with

One Regressor

(„Simple Regression‟)

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives a number for 1→1 association

3. Simple Regression is Better than Correlation

2

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives a number for 1→1 association

3. Simple Regression is Better than Correlation

3

Does Having Too Many Students

Per Teacher Lower Test Marks?

?

4

Scatterplots are Pictures of 1→1

Association

5

Is there a Number for This

Relationship?

6

What about Mean? Variance?

7

Treat this as a Dataset on Student-

teacher ratio (STR), called „X‟

8

Treat this as a Dataset on Student-

teacher ratio (STR), called „X‟

Imagine Falling Rain

9

Collapse onto „X‟ (horizontal) axis

10

Ignore „Y‟ (vertical)

11

Sample Mean

n

x

n

x

x

i

n

i

i

1

x

12

Sample Variance

1

)(

1

)(2

1

2

2

n

xx

n

xx

S

i

n

i

i

x

x

13

Standard error/deviation is the

square root of the variance

1

)(

1

)(2

1

2

2

n

xx

n

xx

SS

i

n

i

i

xx

x

14

(It is very close to a typical departure of x from mean.

„standard‟ = „typical‟; „deviation/error‟ = departures from mean)

n

xx

S

i

x

||

x

15

Treat this as a Dataset on Test

score, called „Y‟

16

Collapse onto Y axis

17

Calculate Mean and Variance

y

2

yS

18

Is there a Number for This

Relationship? Not Yet

19

Break up All Observations

into 4 Quadrants

x

y

I

IVIII

II

20

Fill In the Signs of Deviations from

Means for Different Quadrants

x

y

I

IVIII

II00 yyxx ii 00 yyxx ii

00 yyxx ii00 yyxx ii

21

Fill In the Signs of Deviations from

Means for Different Quadrants

x

y

I

IVIII

II00 yyxx ii 00 yyxx ii

00 yyxx ii00 yyxx ii

22

The Products are Positive in I and III

x

y

I

IVIII

II

0))(( yyxx ii 0))(( yyxx ii

0))(( yyxx ii 0))(( yyxx ii

23

The Products are Negative in II and IV

x

y

I

IVIII

II

0))(( yyxx ii 0))(( yyxx ii

0))(( yyxx ii 0))(( yyxx ii

24

Sample Covariance, Sxy, describes

the Relationship between X and Y

))(()1(

1yyxx

nS iixy

If Sxy > 0 most data lies in I and III:

This concurs with our visual common sense because

it looks like a positive relationship

If Sxy < 0 most data lies in II and IV

This concurs with our visual common sense because

it looks like a negative relationship

If Sxy = 0 data is „evenly spread‟ across I-IV25

What About Our Data?

x

y

I

IVIII

II

26

Large Negative Sxy

27

Large Positive Sxy

y

28

Zero Sxy

y

29

Our Data has a Mild Negative

Covariance Sxy<0

x

y

I

IVIII

II

30

Correlation, rXY, is a Measure of

Relationship that is Unit-less

It can be proved that it lies between -1 and 1.

It has the same sign as SXY so ….

YX

XYXY

SS

Sr

11 XYr

31

Mild Negative Correlation

rXY= -.2264

x

y

I

IVIII

II

32

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

33

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

But…

How much does Y change when X changes?

What is a good guess of Y if X =25?

What does correlation = -.2264 mean anyway?

34

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives a number for 1→1 association

3. Simple Regression is Better than Correlation

35

What is Simple Regression?

Simple regression allows us answer all three

questions:

“How much does Y change when X changes?”

“What is a good guess of Y if X =25?”

“What does correlation = -.2264 mean anyway?”

…by fitting a straight line to data

on two variables, Y and X.

XbbY .1036

fadsf

sa

fadsf

sa

XbbY .10

37

We Get our Guessed Line Using

„(Ordinary) Least Squares [OLS]‟

OLS minimises the squared difference between a

regression line and the observations.

We can view these squared differences as squares.

This task then becomes the minimisation of the

area of the squares.

Applet: http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html

3838

Measures of Fit

(Section 4.3)

The regression R2 can be seen from the applet

http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html

It is the proportional reduction in the sum of squares as one

moves from modeling Y by a constant (with LS estimator

being a sample mean, and sum of squares equal to „total sum

of squares‟) to a line. R2=[TSS-„sum of squares‟]/TSS

If the model fits perfectly „sum of squares‟ = 0 and R2=1

If model does no better than a constant it = TSS and R2=0

The standard error of the regression (SER) measures the

magnitude of a typical regression residual in the units of Y.

39

The Standard Error of the

Regression (SER)

The SER measures the spread of the distribution of u. The SER

is (almost) the sample standard deviation of the OLS residuals:

SER = 2

1

1ˆ ˆ( )

2

n

i

i

u un

= 2

1

2

n

i

i

un

(the second equality holds because u = 1

n

i

i

un

= 0).

40

SER = 2

1

2

n

i

i

un

The SER:

has the units of u, which are the units of Y

measures the average “size” of the OLS residual (the average

“mistake” made by the OLS regression line)

Don‟t worry about the n-2 (instead of n-1 or n) – the reason is

too technical, and doesn‟t matter if n is large.

41

How the Computer Did it(SW Key Concept 4.2)

42

The OLS Line has a Small Negative

Slope

Estimated slope = 1ˆ = – 2.28

Estimated intercept = 0ˆ = 698.9

Estimated regression line: TestScore = 698.9 – 2.28 STR

43

Interpretation of the estimated slope and intercept

Test score= 698.9 – 2.28 STR

Districts with one more student per teacher on average have

test scores that are 2.28 points lower.

That is, Test score

STR = –2.28

The intercept (taken literally) means that, according to this

estimated line, districts with zero students per teacher would

have a (predicted) test score of 698.9.

This interpretation of the intercept makes no sense – it

extrapolates the line outside the range of the data – here, the

intercept is not economically meaningful.

44

Remember Calculus?Test score = 698.9 – 2.28 STR

Differentiation gives dSTR

scoredTest

= –2.28

„d‟ means „infinitely small change‟ but for a „very small

change‟ called „ ‟ it will still be pretty close to the truth. So,

an approximation is:

Test score

STR = –2.28

How to interpret this? Take denominator over the other side.

STRscoreTest 28.2

So, if STR goes up by one, Test score falls by 2.28.

If STR goes up by, say, 20, Test score falls by 2.28 (20)=45.6 45

Predicted values & residuals:

One of the districts in the data set is Antelope, CA, for which

STR = 19.33 and Test Score = 657.8

predicted value: ˆAntelopeY = 698.9 – 2.28 19.33 = 654.8

residual: ˆAntelopeu = 657.8 – 654.8 = 3.0

46

R2 and SER evaluate the Model

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

By using STR, you only reduce the sum of squares by 5%

compared with just ‘modeling’ Test score by its average. That is,

STR only explains a small fraction of the variation in test scores.

The standard residual size is 19, which looks large. 47

48

Seeing R2 and the SER in EVIEWs Dependent Variable: TESTSCR

Method: Least Squares

Sample: 1 420

Included observations: 420

TESTSCR=C(1)+C(2)*STR

Coefficient Std. Error t-Statistic Prob.

C(1) 698.9330 9.467491 73.82451 0.0000

C(2) -2.279808 0.479826 -4.751327 0.0000

R-squared 0.051240 Mean dependent var 654.1565

Adjusted R-squared 0.048970 S.D. dependent var 19.05335

S.E. of regression 18.58097 Akaike info criterion 8.686903

Sum squared resid 144315.5 Schwarz criterion 8.706143

Log likelihood -1822.250 Durbin-Watson stat 0.129062

= 698.9 – 2.28.STR

Recap

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

But…

How much does Y change when X changes?

What is a good guess of Y if X =25?

What does correlation = -.2264 mean anyway?

49

Recap

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

But…

How much does Y change when X changes? b1 x

What is a good guess of Y if X =25? b0+b1(25)

What does correlation = -.2264 mean anyway?

50

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

But…

How much does Y change when X changes? b1 x

What is a good guess of Y if X =25? b0+b1(25)

What does correlation = -.2264 mean anyway?

Surprise: R2=rXY2

51

Outline

1. Scatterplots are Pictures of 1→1 association

2. Correlation gives number for 1→1 association

3. Simple Regression is Better than Correlation

But…

How much does Y change when X changes? b1 x

What is a good guess of Y if X =25? b0+b1(25)

What does correlation = -.2264 mean anyway?

Surprise: R2=rXY2

52

Chapter 5

Regression with a Single

Regressor: Hypothesis Tests

What is Simple Regression?

We‟ve used Simple regression as a means of

describing an apparent relationship between two

variables. This is called descriptive statistics.

Simple regression also allows us to estimate, and

make inferences, under the OLS assumptions,

about the slope coefficients of an underlying

model. We do this, as before, by fitting a straight

line to data on two variables, Y and X. This is

called inferential statistics.

54

The Underlying Model

(or „Population Regression Function‟)

Yi = 0 + 1Xi + ui, i = 1,…, n

X is the independent variable or regressor

Y is the dependent variable

0 = intercept

1 = slope

ui = the regression error or residual

The regression error consists of omitted factors, or possibly

measurement error in the measurement of Y. In general, these

omitted factors are other factors that influence Y, other than

the variable X 55

What Does it Look Like in This Case?

Yi = 0 + 1Xi + ui, i = 1,…, n

X is the STR

Y is the Test score

0 = intercept

1 = Test score

STR

= change in test score for a unit change in STR

If we also guess 0 we can also predict Test score when STR

has a particular value.

Clearly, we want good guesses (estimates) of 0 and 1.

56

A Picture is Worth 1000 Words

57

From Now on we Use „b‟ or „ ‟ to Signify our Guesses, or „Estimates‟ of the Slope or Intercept.We never see the True Line.

ˆ

b0+b1x

1u

2u

58

From Now on we Use „b‟ or „ ‟ to Signify our Estimates of the Slope or Intercept and for guesses for u. We never see the True Line or u‟s.

ˆ

b0+b1x

1u

2u

59

Least squares estimators have a distribution; they are different every time you take a different sample. (like an average of 5 heights, or 7 exam marks)

The estimators are Random Variables. A Random variable generate numbers with a central measure called a mean and a volatility called the Standard Errors

Least squares estimators b0 & b1 have means

Hypothesis testing:

Eg. How to test if the slope is zero, or -37?

Confidence intervals:

Eg. What is a reasonable range of guesses for the slope ?

Our Estimators are Really Random

10 &

1

1

60

Outline

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

61

Outline

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

62

Outline

1. OLS Assumptions (Very Technical)

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

63

Outline

1. OLS Assumptions (When will OLS be ‘good’?)

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

64

Estimator Distributions Depend on

Least Squares Assumptions

A key part of the model is the assumptions made

about the residuals ut for t=1,2….n.

1. E(ut)=0

2. E(ut2)=σ2 =SER2 (note: not σ2

t – invariant)

3. E(utus)=0 t≠s

4. Cov(Xt, ut)=0

5. ut~Normal

65

SW has different assumptions;

Use mine for any Discussions

The conditional distribution of u given X has

mean zero, that is, E(u|X = x) = 0. (a combination

of 1. and 4.)

(Xi,Yi), i =1,…,n, are i.i.d. (unnecessary in many

applications)

Large outliers in X and/or Y are rare. (technical

assumption)

66

How reasonable

are these assumptions?

To answer, we need to understand them.

1. E(ut)=0

2. E(ut2)=σ2 =SER2 (note: not σ2

t – invariant)

3. E(utus)=0 t≠s

4. Cov(Xt, ut)=0

5. ut~Normal

67

It‟s All About u

u is everything left out of the model

E(ut)=0

1. E(ut2)=σ2 =SER2 (note: not σ2

t – invariant)

2. E(utus)=0 t≠s

3. Cov(Xt, ut)=0

4. ut~Normal

68

1. E(ut)=0 is not a big deal

Providing the model has a constant, this is not a

restrictive assumption.

If „all the other influences‟ don‟t have a zero

mean, the estimated constant will just adjust to

the point u does have a zero mean.

Really, B0+u could be thought of as everything

else that affects y apart from x

69

2. E(ut2)=σ2 =SER2 is Controversial

If this assumption holds, the errors are said to be

homoskedastic

If it is violated, the errors are said to be

heteroskedastic (hetero for short)

There are many conceivable forms of hetero, but

perhaps the most common is when the variance

depends upon the value of x

70

Hetero Related to X is very Common

HeteroskedasticHomoskedastic

71

Our Data Looks OK, but Don‟t be Complacent

HeteroskedasticHomoskedastic

72

3. E(utus)=0 t≠s

A violation of this is called autocorrelation

If the underlying model generates data for a time

series, it is highly likely that „left out‟ variables

will be autocorrelated (i.e. z depends on lagged z;

most time series are like this) and so u will be too.

But if the model describes a cross-section

assumption 3 is likely to hold.

73

Aside: Hetero and Auto are not a

DisasterHetero plagues cross-sectional data, Auto plagues

time series.

Remarkably, Heteroskedasticity and

Autocorrelation do not bias the Least Squares

Estimators.

This is a very strange result!

74

Hetero Doesn‟t Bias

y

Draw a least squares

line through these points

x

75

If You Could See the True Line

You‟d Realize hetero is bad for OLS

. y (a) homoskedasticity (b) heteroskedasticity

x

76

But OLS is Still Unbiased!In case (b), OLS is still unbiased because the next draw

is just as likely to find the third error above the true

line, pulling up the (negative) slope of the least squares

line. On average, the true line would be revealed

with many samples.

y (a) (b)

x

But we will make an adjustment to our analysis later OLS is no

longer „best‟ which means minimum variance

77

Conquer Hetero and Auto with Just

One Click

SW recommend you correct standard errors for hetero and auto. In EVIEWs you do this by:

estimate/options/heteroskedasticity consistent coefficient covariance/ input: leave white ticked if only worried about hetero. Tick Newey West to correct for both.

Because OLS is unbiased, the correction only occurs for the standard errors.

Sometimes, we will use standard errors corrected in this way

78

4. Cov(Xt, ut)=0

This will be discussed extensively next lecture

When there is only one variable in a regression, it is highly likely that that variable will be correlated with a variable that is left out of the model, which is implicitly in the error term.

Before proceeding with assumption 5, it is worth stating that 1. – 4. are all that are required to prove the so-called Gauss-Markov Theorem, that OLS is Best Linear Unbiased Estimator (SW Section 5.5)

79

5. ut~Normal

With this assumption, OLS is minimum volatility

estimator among all consistent estimators.

Many variables are Normal

http://rba.gov.au/Statistics/AlphaListing/index.html

The assumption „delivers‟ a known distribution of OLS

estimators (a „t‟ distribution) if n is small. But if n is large

(>30) the OLS estimators become Normal, so it is

unnecessary. This is due to the Central Limit Theorem

http://onlinestatbook.com/stat_sim/sampling_dist/index.html

http://www.rand.org/statistics/applets/clt.html

80

Assessment of Assumptions1. E(ut)=0 harmless if model has constant

2. E(ut2)=σ2 =SER2 not too serious since

3. E(utus)=0 t≠s OLS still unbiased

4. Cov(Xt, ut)=0 Serious – see next lecture

5. ut~Normal Nice Property to have, but if

sample size is big it doesn’t matter

We assume 2. and 3. hold, or just adjust the standard errors. We‟ll also assume n is large and we‟ll always keep a constant, so 5. and 1. are not relevant. This lecture, we assume 4 holds.

81

Outline

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

82

83

With OLS Assumptions the CLT

Gives Us the Distribution of 1

1ˆ ~ ))(,( 2

11 SEN

83

t-distribution (small n) vs Normal

3 2 1 1 2 3

0.1

0.2

0.3

0.4

84

85

With OLS Assumptions the CLT

Gives Us the Distribution of 1

1ˆ ~ ))(,( 2

11 SEN

85

86

With OLS Assumptions the CLT

Gives Us the Distribution of 1

1ˆ ~ ))(,( 2

11 SEN

86

87

EVIEWs output gives us SE(B1)

Dependent Variable: TESTSCR

Method: Least Squares

Date: 06/04/08 Time: 22:13

Sample: 1 420

Included observations: 420 Coefficient Std. Error t-Statistic Prob.

C 698.9330 9.467491 73.82451 0.0000

STR -2.279808 0.479826 -4.751327 0.0000

R-squared 0.051240 Mean dependent var 654.1565

Adjusted R-squared 0.048970 S.D. dependent var 19.05335

S.E. of regression 18.58097 Akaike info criterion 8.686903

Sum squared resid 144315.5 Schwarz criterion 8.706143

Log likelihood -1822.250 Hannan-Quinn criter. 8.694507

F-statistic 22.57511 Durbin-Watson stat 0.129062

Prob(F-statistic) 0.000003

87

Outline

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

88

89

EVIEWs Output Can be Summarized

in Two Lines

Put standard errors in parentheses below the estimated

coefficients to which they apply.

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

(10.4) (0.52)

This expression gives a lot of information

The estimated regression line is

TestScore = 698.9 – 2.28 STR

The standard error of 0ˆ is 10.4

The standard error of 1ˆ is 0.52

The R2 is .05; the standard error of the regression is 18.6

9.4675 0.4798

9.4675

0.4798

89

90

We Only Need Two Numbers For

Hypothesis Testing

Put standard errors in parentheses below the estimated

coefficients to which they apply.

TestScore = 698.9 – 2.28 STR, R2 = .05, SER = 18.6

(10.4) (0.52)

This expression gives a lot of information

The estimated regression line is

TestScore = 698.9 – 2.28 STR

The standard error of 0ˆ is 10.4

The standard error of 1ˆ is 0.52

The R2 is .05; the standard error of the regression is 18.6

9.4675 0.4798

9.4675

0.4798

90

91

Remember Hypothesis Testing?

1. H0 = null hypothesis = „status quo‟ belief = what you believe without good reason to doubt it.

2. H1 = alternative hypothesis = what you believe if you reject H0

3. Collect evidence and create a calculated Test Statistic

4. Decide on a significance level = test size = Prob(type I error) =

5. The test size defines a rejection region and a critical value (the changeover point)

6. Reject H0 if Test Statistic lies in Rejection Region

92

Hypothesis Testing and the Standard

Error of (Section 5.1)

The objective is to test a hypothesis, like 1 = 0, using data – to

reach a tentative conclusion whether the (null) hypothesis is

correct or incorrect.

General setup

Null hypothesis and two-sided alternative:

H0: 1 = 1,0 vs. H1: 1 1,0

where 1,0 is the hypothesized value under the null.

Null hypothesis and one-sided alternative:

H0: 1 = 1,0 vs. H1: 1 < 1,0

92

93

General approach: construct t-statistic, and compute p-value (or

compare to N(0,1) critical value)

In general: t = estimator - hypothesized value

standard error of the estimator

where the SE of the estimator is the square root of an

estimator of the variance of the estimator.

For testing 1, t = 1 1,0

1

ˆ

ˆ( )SE ,

where SE( 1ˆ ) = the square root of an estimator of the variance

of the sampling distribution of 1ˆ

Comparing distance between estimate and your hypothesized

value is obvious; doing it in units of volatility is less so.

93

94

Summary: To test H0: 1 = 1,0 vs.

H1: 1 ≠ 1,0,

Construct the t-statistic

t = 1 1,0

1

ˆ

ˆ( )SE

Reject at 5% significance level if |t| > 1.96

This procedure relies on the large-n approximation; typically

n = 30 is large enough for the approximation to be excellent.

94

95

p-values are another method

See textbook pp 72-81, and look up p-value in the index

1. What values of the test statistic would make you more

determined to reject the null than you are now?

2. If the null is true, what is the probability of obtaining

those values? This is the p-value.

“the p-value, also called the significance probability [not in

QBA] is the probability of drawing a statistic at least as

adverse to the null hypothesis as the one you actually

computed in your sample, assuming the null hypothesis is

correct” pg. 73

95

96

p-values are another method

See textbook pp 72-81, and look it up in the index

For a two sided test, p-value is p = Pr[|t| > |tact

|] = probability

in tails of normal outside |tact

|;

you reject at the 5% significance level if the p-value is < 5%

(or < 1% or <10% depending on test size)

REJECT H0 IF PVALUE <

96

97

Example: Test Scores and STR,

California data

Estimated regression line: TestScore = 698.9 – 2.28STR

Regression software reports the standard errors:

SE( 0ˆ ) = 10.4 SE( 1

ˆ ) = 0.52

t-statistic testing 1,0 = 0 = 1 1,0

1

ˆ

ˆ( )SE =

2.28 0

0.52 = –4.38

The 1% 2-sided significance level is 2.58, so we reject the null

at the 1% significance level.

Alternatively, we can compute the p-value…

(The standard errors are corrected for heteroskedasticity)

97

98

The p-value based on the large-n standard normal approximation

to the t-statistic is 0.00001 (10–5

)

98

Hypothesis Testing Can be Tricky

Dependent Variable: TESTSCR

Method: Least Squares

Date: 06/05/08 Time: 22:35

Sample: 1 420

Included observations: 420 Coefficient Std. Error t-Statistic Prob.

C 655.1223 1.126888 581.3553 0.0000

COMPUTER -0.003183 0.002106 -1.511647 0.1314

„Prob‟ only equals p-value for a two sided test

99

Try These Hypotheses

(a) H0:B1=0, H1:B1>0, with =.05 using the critical-values approach

(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach

(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach

(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach

(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach

(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach

(g) H0:B1=-.05, H1:B1<-.05, with =.10

100

H0:B1=0, H1:B1>0, with =.05 using the critical-values approach

101

(b) H0:B1=0, H1:B1<0, with =.05 using the critical-values approach

102

(c) H0:B1=0, H1:B1≠0, with =.05 using the critical-values approach

103

(d) H0:B1=0, H1:B1>0, with =.05 using the p-value approach

This is very hard to do with p-values!

104

(e) H0:B1=0, H1:B1<0, with =.05 using the p-value approach

105

(f) H0:B1=0, H1:B1≠0, with =.05 using the p-value approach

106

(g) H0:B1=-0.05, H1:B1<-0.05, with =.10

107

Outline

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

108

109

With OLS Assumptions the CLT

Gives Us the Distribution of 1

1ˆ ~ ))(,( 2

11 SEN

109

110

Confidence

Intervals

)1,0(~)(

ˆ

))(,0(~ˆ

))(,(~ˆ

1

11

2

111

2

111

NSE

SEN

SEN

95% Confidence Intervals Catch the

True Parameter 95% of the Time

0.95. is)ˆSE( 1.96.ˆ interval

random by the captured be willy probabilit the,

95.))ˆSE(.96.1ˆ)ˆSE( 1.96.ˆProb(

95.))ˆSE(.96.1ˆ)ˆSE( Prob(1.96.

95.))ˆSE(.96.1-ˆ)ˆSE( .Prob(-1.96

95.)96.1)ˆSE(

-ˆProb(-1.96

So

http://bcs.whfreeman.com/bps4e/content/cat_010/applets/confidenceinterval.html

111

Confidence Intervals are

Reasonable Ranges

. size with testcesignifican oftest

sided- twoain nulls as rejected benot could that valuesof range a as

Interval Confidence -1 a can weother way, theGoing

CI. 95% ain liemust saysjust But this

)ˆSE(.961ˆ)ˆSE(.961ˆ

96.1)ˆSE(

ˆ96.196.1

)ˆSE(

-ˆ96.1

96.1)ˆSE(

-ˆ impliesIt

5%say at, :H offavour in :Hreject cannot weIf 10

define

112

113

Confidence interval example: Test Scores and STR

Estimated regression line: TestScore = 698.9 – 2.28 STR

SE( 0ˆ ) = 10.4 SE( 1

ˆ ) = 0.52

95% confidence interval for 1ˆ :

{ 1ˆ 1.96 SE( 1

ˆ )} = {–2.28 1.96 0.52}

= (–3.30, –1.26)

113

If You Make 1→1 Associations, Use

Simple Regression, not Correlation

1. OLS Assumptions

2. OLS Sampling Distribution

3. Hypothesis Testing

4. Confidence Intervals

But be careful of the Simple Regression assumption

Cov(Xt, ut)=0

114

Chapter 6

Introduction to

Multiple Regression

115

116

Outline

1. Omitted variable bias

2. Multiple regression and OLS

3. Measures of fit

4. Sampling distribution of the OLS estimator

116

117

It‟s all about u

(SW Section 6.1)

The error u arises because of factors that influence Y but are not

included in the regression function; so, there are always omitted

variables.

Sometimes, the omission of those variables can lead to bias in the

OLS estimator. This occurs because the assumption

4. Cov(Xt, ut)=0

Is violated

117

118

Outline

1. Omitted variable bias

2. Multiple regression and OLS

3. Measures of fit

4. Sampling distribution of the OLS estimator

118

119

Omitted variable bias=OVB

The bias in the OLS estimator that occurs as a result of an

omitted factor is called omitted variable bias.

Let y= 0+ 1x+u and let u=f(Z)

omitted variable bias is a problem if the omitted factor “Z” is:

1. A determinant of Y (i.e. Z is part of u); and

2. Correlated with the regressor X (i.e. corr(Z,X) 0)

Both conditions must hold for the omission of Z to result in

omitted variable bias.

119

What Causes Long Life?

Gapminder (http://www.gapminder.org/world) is an online

applet that contains demographic information about each

country in the world.

Suppose that we are interested in predicting life

expectancy, and think that both income per capita and the

number of physicians per 1000 people would make good

indicators.

Our first step would be to graph these predictors against

life expectancy

We find that both are positively correlated with life expectancy

120120

…Doctors or Income or Both?

Simple Linear Regression only allows us to use

one of these predictors to estimate life expectancy.

But income per capita is correlated with the

number of physicians per 1000 people. Suppose

the truth is:

Life=B0+B1Income+B2Doctors+u but you run

Life=B0+B1Income+u* (u*=B2Doctors+u)

121121

OVB=„Double Counting‟

B1 is the impact of Income on Life, holding

everything else constant including the residual

But if correlation exists between the Doctors (in

the residual) and income (rIncDoct≠0 ), and, if the

true impact of Doctors (B2≠0) is non-zero, then B1

counts both effects – it „double counts‟

Life=B0+B1Income+u* (u*=B2Doctors+u)

122122

123

Our Test score Reg has OVB

In the test score example:

1. English language deficiency (whether the student is learning

English) plausibly affects standardized test scores: Z is a

determinant of Y.

2. Immigrant communities tend to be less affluent and thus

have smaller school budgets – and higher STR: Z is

correlated with X.

Accordingly, 1ˆ is biased.

123

What is the bias? We have a formulaSTR is larger for those classes with a higher PctEL (both being a feature of poorer areas), the correlation between STR and PctEL will be positive

PctEL appears in u with a negative sign in front of it – higher PctEL leads to lower scores. Therefore the correlation between STR and u[ minus PctEL], must be negative (ρXu < 0).

Here is the formula. (Standard deviations are always positive)

the volatility of the error and the

included variable matter

So the coefficient of student-teacher ratio is negatively biased by the exclusion of the percentage of English learners. It is „too big‟ in absolute value.

0

ˆ

Xu

X

u

XXBias

124

125

Including PctEL Solves Problem

Some ways to overcome omitted variable bias

1. Run a randomized controlled experiment in which treatment

(STR) is randomly assigned: then PctEL is still a determinant

of TestScore, but PctEL is uncorrelated with STR. (But this is

unrealistic in practice.)

2. Adopt the “cross tabulation” approach, with finer gradations

of STR and PctEL – within each group, all classes have the

same PctEL, so we control for PctEL (But soon we will run

out of data, and what about other determinants like family

income and parental education?)

3. Use a regression in which the omitted variable (PctEL) is no

longer omitted: include PctEL as an additional regressor in a

multiple regression.

125

126

Outline

1. Omitted variable bias

2. Multiple regression and OLS

3. Measures of fit

4. Sampling distribution of the OLS estimator

126

127

The Population Multiple Regression

Model (SW Section 6.2)

Consider the case of two regressors:

Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

Y is the dependent variable

X1, X2 are the two independent variables (regressors)

(Yi, X1i, X2i) denote the ith

observation on Y, X1, and X2.

0 = unknown population intercept

1 = effect on Y of a change in X1, holding X2 constant

2 = effect on Y of a change in X2, holding X1 constant

ui = the regression error (omitted factors)

127

128

Partial Derivatives in Multiple

Regression = Cet. Par. in Economics

Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n

We can use calculus to interpret the coefficients:

1 =1X

Y, holding X2 constant= Ceteris Paribus

2 = 2X

Y, holding X1 constant = Ceteris Paribus

0 = predicted value of Y when X1 = X2 = 0.

128

129

The OLS Estimator in Multiple

Regression (SW Section 6.3)

With two regressors, the OLS estimator solves:

0 1 2

2

, , 0 1 1 2 2

1

min [ ( )]n

b b b i i i

i

Y b b X b X

The OLS estimator minimizes the average squared difference

between the actual values of Yi and the prediction (predicted

value) based on the estimated line.

This minimization problem is solved using calculus

This yields the OLS estimators of 0 , 1 and 2.

129

130

Multiple regression in EViews

TestScore = 686.0 – 1.10 STR – 0.65 PctEL

More on this printout later…

Dependent Variable: TESTSCR

Method: Least Squares

Sample: 1 420

Included observations: 420

White Heteroskedasticity-Consistent Standard Errors & Covariance

TESTSCR=C(1)+C(2)*STR+C(3)*EL_PCT

Coefficient Std. Error t-Statistic Prob.

C(1) 686.0322 8.728224 78.59930 0.0000

C(2) -1.101296 0.432847 -2.544307 0.0113

C(3) -0.649777 0.031032 -20.93909 0.0000

R-squared 0.426431 Mean dependent var 654.1565

Adjusted R-squared 0.423680 S.D. dependent var 19.05335

S.E. of regression 14.46448 Akaike info criterion 8.188387

Sum squared resid 87245.29 Schwarz criterion 8.217246

Log likelihood -1716.561 Durbin-Watson stat 0.685575

130

131

Outline

1. Omitted variable bias

2. Multiple regression and OLS

3. Measures of fit

4. Sampling distribution of the OLS estimator

131

132

Measures of Fit for Multiple

Regression (SW Section 6.4)

R2 now becomes the square of the correlation coefficient

between y and predicted y.

It is still the proportional reduction in the residual sum of

squares as we move from modeling y with just a sample

mean, to modeling it with a group of variables.

132

133

R2 and 2R

The R2 is the fraction of the variance explained – same definition

as in regression with a single regressor:

R2 =

ESS

TSS = 1

SSR

TSS,

where ESS = 2

1

ˆ ˆ( )n

i

i

Y Y , SSR = 2

1

ˆn

i

i

u , TSS = 2

1

( )n

i

i

Y Y .

The R2 always increases when you add another regressor

(why?) – a bit of a problem for a measure of “fit”

133

134

R2 and

The 2R (the “adjusted R2”) corrects this problem by “penalizing”

you for including another regressor – the 2R does not necessarily

increase when you add another regressor.

Adjusted R2:

2R = 1

11

n SSR

n k TSS =

11

1

n SSR

n k TSS

Note that 2R < R2, however if n is large the two will be very

close.

(1-R2)

2R

134

135

Measures of fit, ctd.

Test score example:

(1) TestScore = 698.9 – 2.28STR,

R2 = .05, SER = 18.6

(2) TestScore = 686.0 – 1.10STR – 0.65PctEL,

R2 = .426, 2R = .424, SER = 14.5

What – precisely – does this tell you about the fit of regression

(2) compared with regression (1)?

Why are the R2 and the 2R so close in (2)?

STR

STR

135

136

Outline

1. Omitted variable bias

2. Multiple regression and OLS

3. Measures of fit

4. Sampling distribution of the OLS estimator

136

137

Sampling Distribution Depends on

Least Squares Assumptions (SW Section 6.5)

yi=B0+B1x1i+B2x2i+……Bkxki+ui

• E(ut)=0

1. E(ut2)=σ2 =SER2 (note: not σ2

t – invariant)

2. E(utus)=0 t≠s

3. Cov(Xt, ut)=0

4. ut~Normal plus

5. There is no perfect multicollinearity

137

138

Assumption #4: There is no perfect multicollinearity

Perfect multicollinearity is when one of the regressors is an

exact linear function of the other regressors.

Example: Suppose you accidentally include STR twice:

138

139

Perfect multicollinearity is when one of the regressors is an

exact linear function of the other regressors.

In such a regression (where STR is included twice), 1 is the

effect on TestScore of a unit change in STR, holding STR

constant (???)

The Standard Errors become Infinite when perfect

multicollinearity exists

139

140

OLS Wonder Equation

)1(

1)(

2

on

ˆ

Xxxi

ui

iRnS

SbSE

• Multicollinearity increases R2 and therefore

increases variance of bi

Perfect multicollinearity (R2=1) makes

regression impossible

You expect a low standard error the more

variables you add to a regression. The more

you add, the higher the R-squared on the

denominator becomes, because it always

rises with extra variables.140

141

Quality of Slope Estimate (R2

and fixed)High SE(bi) Low SE(bi) Low SE(bi)

xi xi xi

Sxi2 = Sxi

2 < Sxi2

n=6 n=20 n=6

uS ˆ

141

142

The Sampling Distribution of the

OLS Estimator (SW Section 6.6)

Under the Least Squares Assumptions,

The exact (finite sample) distribution of 1ˆ has mean 1,

var( 1ˆ ) is inversely proportional to n; so too for 2

ˆ .

Other than its mean and variance, the exact (finite-n)

distribution of 1ˆ is very complicated; but for large n…

1ˆ is consistent: 1

ˆ p

1 (law of large numbers)

)ˆ(

ˆ

1

11

SE is approximately distributed N(0,1) (CLT)

So too for 2ˆ ,…, ˆ

k

Conceptually, there is nothing new here!

142

143

Multicollinearity, Perfect and

Imperfect (SW Section 6.7)

Some more examples of perfect multicollinearity

The example from earlier: you include STR twice.

Second example: regress TestScore on a constant, D, and

Bel, where: Di = 1 if STR ≤ 20, = 0 otherwise; Beli = 1 if STR

>20,

= 0 otherwise, so Beli = 1 – Di and there is perfect

multicollinearity because Bel+D=1 (the 1 „variable‟ for the

constant)

To fix this, drop the constant

143

144

Perfect multicollinearity, ctd.

Perfect multicollinearity usually reflects a mistake in the

definitions of the regressors, or an oddity in the data

If you have perfect multicollinearity, your statistical software

will let you know – either by crashing or giving an error

message or by “dropping” one of the variables arbitrarily

The solution to perfect multicollinearity is to modify your list

of regressors so that you no longer have perfect

multicollinearity.

144

145

Imperfect multicollinearity

Imperfect and perfect multicollinearity are quite different despite

the similarity of the names.

Imperfect multicollinearity occurs when two or more regressors

are very highly correlated.

Why this term? If two regressors are very highly

correlated, then their scatterplot will pretty much look like a

straight line – they are collinear – but unless the correlation

is exactly 1, that collinearity is imperfect.

145

146

Imperfect multicollinearity, ctd.

Imperfect multicollinearity implies that one or more of the

regression coefficients will be imprecisely estimated.

Intuition: the coefficient on X1 is the effect of X1 holding X2

constant; but if X1 and X2 are highly correlated, there is very

little variation in X1 once X2 is held constant – so the data are

pretty much uninformative about what happens when X1

changes but X2 doesn‟t, so the variance of the OLS estimator

of the coefficient on X1 will be large.

Imperfect multicollinearity (correctly) results in large

standard errors for one or more of the OLS coefficients as

described by the OLS wonder equation

Next topic: hypothesis tests and confidence intervals…

146

Portion of X that “explains” Y

Y

X

High R2

For any two circles,

the overlap tells the

size of the R2

147

Portion of X that “explains” Y

Y

X

LowR2

For any two circles,

the overlap tells the

size of the R2

148

Adding Another X Increases R2

Y

X1

X2

Now the R is the overlap of both X1

and X2 with Y

149

Imperfect (but high)

multicollinearity

Y

X1

X2

Since X2 and X1 share a lot of

the same information, adding

X2 allows us to work out

independent effects better, but

we realize we don‟t have

much information (area) to do

this with. Larger n makes all

circles bigger and, as before,

the overlap tells the size of R2

)1(

1)(

2

on1

ˆ1

21 xxx

u

RnS

SbSE

150

Chapter 7: Multiple Regression:

Multiple Coefficient Testing

151

Multiple Coefficients Tests?

We know how to obtain estimates of the

coefficients, and each one is a ceteris paribus („all

other things equal‟) effect

Why would we want to do hypotheses tests about

groups of coefficients?

yt = 0 + 1x1t + 2x2t + . . kxkt + et

152

Multiple Coefficients Tests

Example 1: Consider the statement that „this whole model is worthless‟.

One way of making that statement mathematically formal is to say

1 = 2=….= k=0

because if this is true then none of the variables x1, x2. . xk helps explain y

yt = 0 + 1x1t + 2x2t + . . kxkt + et

153

Multiple Coefficients Tests

Example 2: Suppose y is the share of the population that votes

for the ruling party and x1 and x2 is the spending on TV and

radio advertising.

The Prime Minister might want to know if TV is more effective than radio, as measured by the impact on the share of the popular vote for of an extra dollar spent on each. The way to write this mathematically is

1 > 2

yt = 0 + 1x1t + 2x2t + . . kxkt + et

154

Multiple Coefficients Tests

Example 3: Suppose y is the growth in GDP, x1 and x2 are the

cash rate one and two quarters ago, and that all the other X‟s are different macroeconomic variables.

Suppose we are interested in testing the effectiveness of monetary policy. One way of doing this is asking if the cash rate at any lag has an impact on GDP growth. Mathematically, this is

1 = 2 = 0

yt = 0 + 1x1t + 2x2t + . . kxkt + et

155

Multiple Coefficients Tests

In each case, we are interested in making statements about groups of coefficients.

What about just looking at the estimates?

Same problem as in t-testing. You ought to care about reliability.

What about sequential testing?

errors compound, even if possible (SW Sect. 7.2)

yt = 0 + 1x1t + 2x2t + . . kxkt + et

156

Multiple Coefficients Tests

The so-called F-test can do all of these

restrictions, except for example 2.

Before turning to the F-test, let‟s do example 2,

which can be done with a t-test

yt = 0 + 1x1t + 2x2t + . . kxkt + et

157

Example 2 Solution

If then . Sub this in.

yt = 0 + ( ) x1t + 2 x2t + . + et

= 0 + x1t + 2(x1t + x2t)+ . + et

So, to test 1 > 2 just run a new regression including

x1+x2 instead of x2 (everything else is left the same) and do a

t-test for H0: =0 vs. H1: >0. Naturally, if you accept H1: >0,

this implies 1 2 >0 which implies 1 > 2

This technique is called reparameterization

yt = 0 + 1x1t + 2x2t + . . kxkt + et

158

Restricted Regressions

One more thing before we do the F-test, we

must define a „restricted regression‟. This is

just the model you get when a hypothesis is

assumed true

yt = 0 + 1x1t + 2x2t + . . kxkt + et

159

Restricted Regression: Example

1

Example 1: Consider the statement that „this whole model is worthless‟.

If 1 = 2=….= k=0then the model is

yt = 0 + etand the restricted regression would be an OLS regression of y on a constant. The estimate for the constant will just be the sample mean of y.

yt = 0 + 1x1t + 2x2t + . . kxkt + et

160

If 1 = 2 = 0 then the model is

yt = 0 + 3x3t + . . kxkt + et

and the restricted regression is an OLS

regression of y on a constant and x3 to xk

Restricted Regression: Example 3

yt = 0 + 1x1t + 2x2t + . . kxkt + et

161

Properties of Restricted

Regressions

Imposing a restriction always increases the residual sum of squares, since you are forcing the estimates to take the values implied by the restriction, rather than letting OLS choose the values of the estimates to minimize the SSR

If the SSR increases a lot, it implies that the restriction is relatively „unbelievable‟. That is, the model fits a lot worse with the restriction imposed.

This last point is the basic intuition of the F-test –impose the restriction and see if SSR goes up „too much‟. http://hadm.sph.sc.edu/Courses/J716/demos/LeastSquares/LeastSquaresDemo.html

162

The F-testTo test a restriction we need to run the restricted

regression as well as the unrestricted regression (i.e.

the original regression). Let q be the number of

restrictions.

Intuitively, we want to know if the change in SSR is

big enough to suggest the restriction is wrong

edunrestrictisur andrestrictedisr

where,k )1 nSSR

qSSRSSRF

ur

urr

163

The F statistic

The F statistic is always positive, since the SSR from the restricted model can‟t be less than the SSR from the unrestricted

Essentially the F statistic is measuring the relative increase in SSR when moving from the unrestricted to restricted model

q = number of restrictions

164

The F statistic (cont)

To decide if the increase in SSR when we move to

a restricted model is “big enough” to reject the

restrictions, we need to know about the sampling

distribution of our F stat

Not surprisingly, F ~ Fq,n-k-1, where q is referred to

as the numerator degrees of freedom and n – k-1 as

the denominator degrees of freedom

165

0 c

f(F)

F

The F statistic

rejectfail to reject

Reject H0 at

significance level

if F > c

166

0 c

f(F)

F

Equivalently, using p-values

rejectfail to reject

Reject H0if p-value <

167

The R2 form of the F statistic

Because the SSR‟s may be large and unwieldy, an

alternative form of the formula is useful

We use the fact that SSR = TSS(1 – R2) for any

regression, so can substitute in for SSRu and SSRur

edunrestrictisur andrestrictedisr

again where,1

2

22

k-1nR

qRRF

ur

rur

168

Overall Significance (example 1)

A special case of exclusion restrictions is to test H0: 2 =

3 =…= k = 0

R2 =0 for a model with only an intercept

This is because the OLS estimator is just the sample mean,

implying the TSS=SSR

the F statistic is then

12

2

k-1nR

kRF

169

Dependent Variable: TESTSCR

Method: Least Squares

Date: 06/05/08 Time: 15:29

Sample: 1 420

Included observations: 420 Coefficient Std. Error t-Statistic Prob.

C 675.6082 5.308856 127.2606 0.0000

MEAL_PCT -0.396366 0.027408 -14.46148 0.0000

AVGINC 0.674984 0.083331 8.100035 0.0000

STR -0.560389 0.228612 -2.451272 0.0146

EL_PCT -0.194328 0.031380 -6.192818 0.0000

R-squared 0.805298 Mean dependent var 654.1565

Adjusted R-squared 0.803421 S.D. dependent var 19.05335

S.E. of regression 8.447723 Akaike info criterion 7.117504

Sum squared resid 29616.07 Schwarz criterion 7.165602

Log likelihood -1489.676 Hannan-Quinn criter. 7.136515

F-statistic 429.1152 Durbin-Watson stat 1.545766

Prob(F-statistic) 0.000000

12

2

k-1nR

kRF

[.8053/4]/[{1-.8053}/(420-5)] = 429170

General Linear Restrictions

The basic form of the F statistic will work for any

set of linear restrictions

First estimate the unrestricted model and then

estimate the restricted model

In each case, make note of the SSR

Imposing the restrictions can be tricky – will likely

have to redefine variables again

171

F Statistic Summary

Just as with t statistics, p-values can be calculated by looking up the percentile in the appropriate Fdistribution

If only one exclusion is being tested, then F = t2, and the p-values will be the same

F-tests are done mechanically – you don‟t have to do the restricted regressions (though you have to understand how to do them for this course).

172

F-tests are Easy in EVIEWsTo test hypotheses like these in EVIEWs, use the Wald test. After you run your regression, type „View, Coefficient tests, Wald‟

Try testing a single restriction (which you can use a t-test for) and see that t2=F, and, that the p-values are the same.

Try testing all the coefficients except the intercept are zero, and compare it with the F-test automatically calculated in EVIEWs.

SW discusses the shortcomings of F-tests at length. They crucially depend upon the assumption of homoskedasticity.

173

Start Big and Go SmallGeneral to Specific Modeling relies upon the fact that

omitted variable bias is a serious problem.

Start with a very big model to avoid OVB

Do t-tests on individual coefficients. Delete the most

insignificant, run the model again, delete the most

insignificant variable, run the model again, and so

on….until every individual coefficient is significant.

Finally, Do an F-test on the original model excluding all

the coefficients required to get to your final model at once.

If the null is accepted, you have verified the model.

Test for Hetero, and correct for it if need be.

174

Chapter 8

Nonlinear Regression

Functions

175

176

„Linear‟ Regression = Linear in

Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments

2. Polynomials

3. Logs

4. Nonlinear functions of two variables: interactions

176

177

„Linear‟ Regression = Linear in

Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments

2. Polynomials

3. Logs

4. Nonlinear functions of two variables: interactions

177

178

Nonlinear Regression Population Regression

Functions – General Ideas (SW Section 8.1)

If a relation between Y and X is nonlinear:

The effect on Y of a change in X depends on the value of X –

that is, the marginal effect of X is not constant

A linear regression is mis-specified – the functional form is

wrong

The estimator of the effect on Y of X is biased – it needn‟t

even be right on average.

The solution to this is to estimate a regression function that is

nonlinear in X

178

179

Nonlinear Functions of a Single

Independent Variable (SW Section 8.2)

We‟ll look at two complementary approaches:

1. Polynomials in X

The population regression function is approximated by a

quadratic, cubic, or higher-degree polynomial

2. Logarithmic transformations

Y and/or X is transformed by taking its logarithm

this gives a “percentages” interpretation that makes sense

in many applications

179

180

„Linear‟ Regression = Linear in

Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments

2. Polynomials

3. Logs

4. Nonlinear functions of two variables: interactions

180

181

2. Polynomials in X

Approximate the population regression function by a polynomial:

Yi = 0 + 1Xi + 22

iX +…+ rr

iX + ui

This is just the linear multiple regression model – except that

the regressors are powers of X!

Estimation, hypothesis testing, etc. proceeds as in the

multiple regression model using OLS

The coefficients are difficult to interpret, but the regression

function itself is interpretable

181

182

Example: the TestScore – Income

relation Incomei = average district income in the i

th district

(thousands of dollars per capita)

Quadratic specification:

TestScorei = 0 + 1Incomei + 2(Incomei)2 + ui

Cubic specification:

TestScorei = 0 + 1Incomei + 2(Incomei)2

+ 3(Incomei)3 + ui

182

183

Estimation of the quadratic

specification in EViews

Test the null hypothesis of linearity against the alternative that

the regression function is a quadratic….

Dependent Variable: TESTSCR

Method: Least Squares

Sample: 1 420

Included observations: 420

White Heteroskedasticity-Consistent Standard Errors & Covariance

TESTSCR=C(1)+C(2)*AVGINC + C(3)*AVGINC*AVGINC

Coefficient Std. Error t-Statistic Prob.

C(1) 607.3017 2.901754 209.2878 0.0000

C(2) 3.850995 0.268094 14.36434 0.0000

C(3) -0.042308 0.004780 -8.850509 0.0000

R-squared 0.556173 Mean dependent var 654.1565

Adjusted R-squared 0.554045 S.D. dependent var 19.05335

S.E. of regression 12.72381 Akaike info criterion 7.931944

Sum squared resid 67510.32 Schwarz criterion 7.960803

Log likelihood -1662.708 Durbin-Watson stat 0.951439

Create a quadratic regressor

183

184

Interpreting the estimated

regression function:

(a) Plot the predicted values

TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

(2.9) (0.27) (0.0048)

184

185

Interpreting the estimated

regression function, ctd: (b) Compute “effects” for different values of X

TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

(2.9) (0.27) (0.0048)

Predicted change in TestScore for a change in income from

$5,000 per capita to $6,000 per capita:

TestScore = 607.3 + 3.85 6 – 0.0423 62

– (607.3 + 3.85 5 – 0.0423 52)

= 3.4

185

186

TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

Predicted “effects” for different values of X:

Change in Income ($1000 per capita) TestScore

from 5 to 6 3.4

from 25 to 26 1.7

from 45 to 46 0.0

The “effect” of a change in income is greater at low than high

income levels (perhaps, a declining marginal benefit of an

increase in school budgets?)

Caution! What is the effect of a change from 65 to 66?

Don’t extrapolate outside the range of the data!

186

187

TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

Predicted “effects” for different values of X:

Change in Income ($1000 per capita) TestScore

from 5 to 6 3.4

from 25 to 26 1.7

from 45 to 46 0.0

Alternatively, dTestscore/dIncome = 3.85-.0846 (Income)

gives the same numbers (approx)

187

188

Summary: polynomial regression

functions Yi = 0 + 1Xi + 2

2

iX +…+ rr

iX + ui

Estimation: by OLS after defining new regressors

Coefficients have complicated interpretations

To interpret the estimated regression function:

plot predicted values as a function of x

compute predicted Y/ X at different values of x

Hypotheses concerning degree r can be tested by t- and F-

tests on the appropriate (blocks of) variable(s).

Choice of degree r

plot the data; t- and F-tests, check sensitivity of estimated

effects; judgment.

188

A Final Warning: Polynomials

Can Fit Too Well

When fitting a polynomial regression function, we

need to be careful not to fit too many terms, despite

the fact that a higher order polynomial will always

fit better.

If we do fit too many terms, then any prediction

may become unrealistic.

The following applet lets us explore fitting

different polynomials to some data.http://www.scottsarra.org/math/courses/na/nc/polyRegression.html

189189

3. Are Polynomials Enough?

We can investigate the appropriateness of a

regression function by graphing the regression

function over the top of the scatterplot.

For some models, we may need to transform the

data

For example, take logs of the response variable

The site below allows us to do this, exploring some

common regression functionshttp://www.ruf.rice.edu/%7Elane/stat_sim/transformations/index.html

190190

191

„Linear‟ Regression = Linear in

Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments

2. Polynomials

3. Logs

4. Nonlinear functions of two variables: interactions

191

192

3. Logarithmic functions of Y and/or X

ln(X) = the natural logarithm of X

Logarithmic transforms permit modeling relations in

“percentage” terms (like elasticities), rather than linearly.

Here’s why:

Numerically:

ln(1.01)-ln(1) = .00995-0 =.00995 (correct % .01);

ln(40)-ln(45) = 3.6889-3.8067=-.1178 (correct % = -.1111)

xx

xx

xx

x

xx

x

in change alproportion)ln(1)ln(

1)ln(

192

193

Three log regression specifications:

Case Population regression function

I. linear-log Yi = 0 + 1ln(Xi) + ui

II. log-linear ln(Yi) = 0 + 1Xi + ui

III. log-log ln(Yi) = 0 + 1ln(Xi) + ui

The interpretation of the slope coefficient differs in each case.

The interpretation is found by applying the general “before

and after” rule: “figure out the change in Y for a given change

in X.”

193

194

Summary: Logarithmic

transformations

Three cases, differing in whether Y and/or X is transformed

by taking logarithms.

The regression is linear in the new variable(s) ln(Y) and/or

ln(X), and the coefficients can be estimated by OLS.

Hypothesis tests and confidence intervals are now

implemented and interpreted “as usual.”

The interpretation of 1 differs from case to case.

Choice of specification should be guided by judgment (which

interpretation makes the most sense in your application?),

tests, and plotting predicted values

194

195

„Linear‟ Regression = Linear in

Parameters, Not Nec. Variables

1. Nonlinear regression functions – general comments

2. Polynomials

3. Logs

4. Nonlinear functions of two variables: interactions

195

Regression when X is Binary

(Section 5.3)

Sometimes a regressor is binary:

X = 1 if small class size, = 0 if not

X = 1 if female, = 0 if male

X = 1 if treated (experimental drug), = 0 if not

Binary regressors are sometimes called “dummy” variables.

So far, 1 has been called a “slope,” but that doesn‟t make sense

if X is binary.

How do we interpret regression with a binary regressor?

196

Interpreting regressions with a

binary regressor

Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):

When Xi = 0, Yi = 0 + ui

the mean of Yi is 0

that is, E(Yi|Xi=0) = 0

When Xi = 1, Yi = 0 + 1 + ui

the mean of Yi is 0 + 1

that is, E(Yi|Xi=1) = 0 + 1

so:

1 = E(Yi|Xi=1) – E(Yi|Xi=0)

= population difference in group means

197

198

Interactions Between Independent

Variables (SW Section 8.3)

Perhaps a class size reduction is more effective in some

circumstances than in others…

Perhaps smaller classes help more if there are many English

learners, who need individual attention

That is, TestScore

STR might depend on PctEL

More generally, 1

Y

X might depend on X2

How to model such “interactions” between X1 and X2?

We first consider binary X‟s, then continuous X‟s

198

199

(a) Interactions between two binary

variables

Yi = 0 + 1D1i + 2D2i + ui

D1i, D2i are binary

1 is the effect of changing D1=0 to D1=1. In this specification,

this effect doesn’t depend on the value of D2.

To allow the effect of changing D1 to depend on D2, include the

“interaction term” D1i D2i as a regressor:

Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui

199

200

Interpreting the coefficients:

Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui

It can be shown that

1D

Y

= 1 + 3D2

The effect of D1 depends on D2 (what we wanted)

3 = increment to the effect of D1 from a unit change in D2

200

201

Example: TestScore, STR, English

learnersLet

HiSTR = 1 if 20

0 if 20

STR

STR and HiEL =

1 if l0

0 if 10

PctEL

PctEL

TestScore = 664.1 – 18.2HiEL – 1.9HiSTR – 3.5(HiSTR HiEL)

(1.4) (2.3) (1.9) (3.1)

“Effect” of HiSTR when HiEL = 0 is –1.9

“Effect” of HiSTR when HiEL = 1 is –1.9 – 3.5 = –5.4

Class size reduction is estimated to have a bigger effect when

the percent of English learners is large

This interaction isn‟t statistically significant: t = 3.5/3.1

201

202

(b) Interactions between continuous

and binary variables

Yi = 0 + 1Xi + 2Di + ui

Di is binary, X is continuous

As specified above, the effect on Y of X (holding constant D) =

1, which does not depend on D

To allow the effect of X to depend on D, include the

“interaction term” Di Xi as a regressor:

Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui

202

203

Binary-continuous interactions: the

two regression lines

Yi = 0 + 1Xi + 2Di + 3(Di Xi) + ui

Observations with Di= 0 (the “D = 0” group):

Yi = 0 + 1Xi + ui The D=0 regression line

Observations with Di= 1 (the “D = 1” group):

Yi = 0 + 1Xi + 2 + 3Xi + ui

= ( 0+ 2) + ( 1+ 3)Xi + ui The D=1 regression line

203

204

Binary-continuous interactions, ctd.

D = 0D = 1 D = 1

D = 0

D = 0

D = 1

B3=0

All Bi non-zero

B2=0

204

205

Interpreting the coefficients:

Yi = 0 + 1Xi + 2Di + 3(Xi Di) + ui

Or, using calculus,

X

Y= 1 + 3D

The effect of X depends on D (what we wanted)

3 = increment to the effect of X1 from a change in the level

of D from D=0 to D=1

205

206

Example: TestScore, STR, HiEL

(=1 if PctEL 10)

TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)

(11.9) (0.59) (19.5) (0.97)

When HiEL = 0:

TestScore = 682.2 – 0.97STR

When HiEL = 1,

TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR

= 687.8 – 2.25STR

Two regression lines: one for each HiSTR group.

Class size reduction is estimated to have a larger effect when

the percent of English learners is large.

206

207

Example, ctd: Testing hypotheses

TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)

(11.9) (0.59) (19.5) (0.97)

The two regression lines have the same slope the

coefficient on STR HiEL is zero: t = –1.28/0.97 = –1.32

The two regression lines have the same intercept the

coefficient on HiEL is zero: t = –5.6/19.5 = 0.29

The two regression lines are the same population

coefficient on HiEL = 0 and population coefficient on

STR HiEL = 0: F = 89.94 (p-value < .001) !!

We reject the joint hypothesis but neither individual

hypothesis (how can this be?)

207

208

Summary: Nonlinear Regression

Functions

Using functions of the independent variables such as ln(X)

or X1 X2, allows recasting a large family of nonlinear

regression functions as multiple regression.

Estimation and inference proceed in the same way as in

the linear multiple regression model.

Interpretation of the coefficients is model-specific, but the

general rule is to compute effects by comparing different

cases (different value of the original X‟s)

Many nonlinear specifications are possible, so you must

use judgment:

What nonlinear effect you want to analyze?

What makes sense in your application?

208

Chapter 9

Misleading Statistics

209

Statistics Means Description and

Inference

Descriptive Statistics is about describing datasets.

Various visual tricks can distort these descriptions

Inferential Statistics is about statistical inference.

You know something about tricks to distort

inference (eg. Putting in lots of variables to raise

R2 or lowering to get in a variable you want).

210

Pitfalls of Analysis

There are several ways that misleading statistics

can occur (which effect both inferential and

descriptive statistics)

Obtaining flawed data

Not understanding the data

Not choosing appropriate displays of data

Fitting an inappropriate model

Drawing incorrect conclusions from analysis.

211

Poor Displays of Data: Chart

Junk

Source: Wainer (1984), How to display data badly 212

Poor Displays of Data: 2D

picture

213

Poor Displays of Data: Axes

A jump in the scale from

800,000 to 1,500,000

Increments of 100,000

214

How to Display Data

• The golden rule for displaying data in a graph is to keep it simple

• Graphs should not have any chart junk.– “minimise the ratio of ink to data” - Tufte

• Axes should be chosen so they do not inflate or deflate the differences between observations– Where possible, start the Y-axis at 0

– If this is not possible then you should consider graphing the change in the observation from one period to the next

• Some general tips on how to properly display data can be found at http://lilt.ilstu.edu/gmklass/pos138/datadisplay/sections/goodcharts.htm

215

How to Display Data

216

Incorrect Conclusions: Causality

Excess money supply (%) Increase in prices two years

later (%)

1965 4.7 1967 2.5

1966 1.9 1968 4.7

1967 7.8 1969 5.4

1968 4.0 1970 6.4

1969 1.3 1971 9.4

1970 7.8 1972 7.1

1971 11.4 1973 9.2

1972 23.4 1974 16.1

1973 22.2 1975 24.2

Correlation: 0.848

Source: Grenville and Macfarlane (1988) 217

Accompanying LetterSir,

Professor Lord Kaldor today (March 31) states that

“there is no historical evidence whatever” that the money

supply determines the future movement of prices with

a time lag of two years. May I refer Professor Kaldor to

your article in The Times of July 13, 1976.

Data

If one calculates the correlation between these two sets

of figures the coefficient r=0.848 and since there are seven

degrees of freedom the P value is less than 0.01. If Mr

Rees-Mogg‟s figures are correct, this would appear to a

biologist to be a highly significant correlation, for it means

that the probability of the correlation occurring by chance

is less than one in a hundred. Most betting men would

think that those were impressive odds.

Until Professor Kaldor can show a fallacy in the figures,

I think Mr Rees-Mogg has fully established his point.

Yours faithfully,

IVOR H. MILLS,

University of Cambridge Clinical School,

Department of Medicine,

218

ResponseSir,

Professor Mills today (April 4) uses correlation

analysis in your columns to attempt to resolve the

theoretical dispute over the cause(s) of inflation. He

cites a correlation coefficient of 0.848 between the rate

of inflation and the rate of change of “excess” money

supply two years before.

We were rather puzzled by this for we have always

believed that it was Scottish Dysentery that kept prices

down (with a one-year lag, of course). To reassure

ourselves, we calculated the correlation between the

following sets of figures:

219

Incorrect Conclusions: Causality

Correlation: -0.868Cases of Dysentery in

Scotland („000)

Increase in prices one year

later (%)

1966 4.3 1967 2.5

1967 4.5 1968 4.7

1968 3.7 1969 5.4

1969 5.3 1970 6.4

1970 3.0 1971 9.4

1971 4.1 1972 7.1

1972 3.2 1973 9.2

1973 1.6 1974 16.1

1974 1.5 1975 24.2

Source: Grenville and Macfarlane (1988) 220

A Final Warning

We have to inform you that the correlation coefficient

is -0.868 (which is statistically slightly more significant

than that obtained by Professor Mills). Professor Mills says

that “Until … a fallacy in the figures [can be shown], I

think Mr Rees-Mogg has fully established his point.” By

the same argument, so have we.

Yours faithfully.

G. E. J. LLEWELLYN, R. M. WITCOMB.

Faculty of Economics and Politics,

221


Recommended