Research Method

1

Research MethodResearch Method

Lecture 9 (Ch9)Lecture 9 (Ch9)

More on More on specification and specification and

Data issuesData issues©

Using Proxy Variables for Using Proxy Variables for Unobserved Explanatory Unobserved Explanatory

VariablesVariables Suppose you are interested in estimating the

return to Education. So you consider the following model.

Log(Wage)=β0+β1Educ+ β2Exp+ (β3Ability+u) …(1)

Ability is unobserved, so it is included in the composite error term. If Ability is correlated with the year of education, β1 will be biased.

Question: if ability is correlated with Educ, what is the direction of the bias?

2

One way to eliminate the bias is to use a Panel data then apply the fixed effect or the first differencing method.

Another method is to use a proxy variable for ability. This is the topic of this section.

Suppose that IQ is a proxy variable for ability, and that IQ is available in your data.

3

Then, the basic idea is to estimate the following.

Regress Log(Wage) on Educ, Exp, and IQ ……………(2)

This is called the plug-in solution to the omitted variables problem.

The question is under what conditions (2) produces consistent estimates for the original regression (1). I will explain these conditions using the above example (though the arguments can be easily generalized).

It turns out, the following two conditions ensure that you get consistent estimates by using the plug-in solution.

4

Condition 1: u is uncorrelated with IQ. In addition, the original equation should satisfy the usual conditions (i.e, u is also uncorrelated with Educ, Exp, and Ability).

Condition 2: E(Ability|Educ, Exp, IQ)=E(Ability|IQ)

Condition 2 means that, once IQ is conditioned, Educ and Exp does not explain Ability. More simple way to express condition 2 is that the ability can be written as:

Ability=δ0+δ3IQ+v3 …………(3)

where, v3 is a random error which is uncorrelated with either IQ, Educ or Exp. What it means is that Ability is a function of IQ only.

5

Omitted variable The initial explanatory variables

The proxy variable

Then, it is clear why these two conditions guarantee that the plug-in condition produces consistent estimates. Just plug (3) into (1). Then you have

Log(Wage)=(β0+δ0)+β1Educ + β2Exp + β3δ3IQ + (u+β3v3 ) …(4)

Where

Since u and v3 are uncorrelated with any of the explanatory variables under condition1 and condition 2, the slope parameters are consistent. The intercept has changed, but usually you are not interested in the intercept. Importantly, you get consistent estimates for the slope parameters.

6

It is also obvious that, if condition 2 is violated, then the plug in solution will not work. If the condition 2 is violated, then ability will be a function of not only IQ, but also Educ and Exp. So you will have:

Ability=δ0+ δ1Educ+δ2Exp+δ3IQ+v3 …(5)

If you plug (5) into (1), you have

Log(Wage)=(β0+δ0)+(β1+β3δ1)Educ + (β2+β3δ2)Exp + β3δ3IQ + (u+β3v3 ) …(4)

Thus, the coefficient for Educ is no longer β1, but it is β1+β3δ1. Thus, the plug-in solution produces inconsistent estimates when condition 2 is violated.

7

If condition 2 is violated then, ability is a function of all the variables.

ExerciseExerciseEx.1: Use Wage2.dta to estimate a log wage

equation to examine the return to education. Include in the equation exper, tenure, married, south, urban, black. Do you think that the return to education is unbiased? What do you think is the direction of the bias

Ex.2: Now, use IQ as a proxy for unobserved ability. Did the result change? Was your prediction of the direction of the bias correct?

8

Answer: OLS without IQAnswer: OLS without IQ

9 _cons 5.395497 .113225 47.65 0.000 5.17329 5.617704 black -.1883499 .0376666 -5.00 0.000 -.2622717 -.1144281 urban .1839121 .0269583 6.82 0.000 .1310056 .2368185 south -.0909036 .0262485 -3.46 0.001 -.142417 -.0393903 married .1994171 .0390502 5.11 0.000 .1227801 .276054 tenure .0117473 .002453 4.79 0.000 .0069333 .0165613 exper .014043 .0031852 4.41 0.000 .007792 .020294 educ .0654307 .0062504 10.47 0.000 .0531642 .0776973 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 165.656283 934 .177362188 Root MSE = .36547 Adj R-squared = 0.2469 Residual 123.818521 927 .133569063 R-squared = 0.2526 Model 41.8377619 7 5.97682312 Prob > F = 0.0000 F( 7, 927) = 44.75 Source SS df MS Number of obs = 935

. reg lwage educ exper tenure married south urban black

. use "D:\My Documents\IUJ_teaching\Research Methodology\Wooldridge Econometrics resources\data\WAGE2.DTA", clear

Answer: OLS with IQAnswer: OLS with IQ

10

_cons 5.176439 .1280006 40.44 0.000 4.925234 5.427644 IQ .0035591 .0009918 3.59 0.000 .0016127 .0055056 black -.1431253 .0394925 -3.62 0.000 -.2206304 -.0656202 urban .1819463 .0267929 6.79 0.000 .1293645 .2345281 south -.0801695 .0262529 -3.05 0.002 -.1316916 -.0286473 married .1997644 .0388025 5.15 0.000 .1236134 .2759154 tenure .0113951 .0024394 4.67 0.000 .0066077 .0161825 exper .0141458 .0031651 4.47 0.000 .0079342 .0203575 educ .0544106 .0069285 7.85 0.000 .0408133 .068008 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lwage educ exper tenure married south urban black IQ

Using lagged dependent Using lagged dependent variable as proxy variablesvariable as proxy variables

Often the lag of the dependent variable is used as a proxy for the unobserved variables.

First consider the following model. (Crime rate) =β0+β1(unemp) + β2(expenditure) +u If there are omitted factors that directly affect

crime rate and at the same time correlated with unemployment rate, β1 will be biased. The omitted factors may be some pre-existing conditions, like demographic features (age, race etc). Crime rate could be different among cities for historical factors.

11

The idea is that, the lag of the dependent variable may summarize such pre-existing conditions.

So, estimate the following equation

(Crime rate)it =β0+β1(unemp)it + β2(expenditure)it

+ β3(Crime rate)it-1 +uit

The following slides estimate the model using CRIME2.dta

12

ExampleExample We estimate Crime2.dta to estimate the

regressions. Results are the following.

13

_cons 3.342899 1.250527 2.67 0.011 .8209721 5.864826 llawexpc .2033652 .1726534 1.18 0.245 -.1448236 .5515539 unem -.0290032 .0323387 -0.90 0.375 -.0942205 .0362141 lcrmrte Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 4.76196934 45 .105821541 Root MSE = .32314 Adj R-squared = 0.0133 Residual 4.48998214 43 .104418189 R-squared = 0.0571 Model .271987199 2 .1359936 Prob > F = 0.2824 F( 2, 43) = 1.30 Source SS df MS Number of obs = 46

. reg lcrmrte unem llawexpc if year==87

. use "D:\My Documents\IUJ_teaching\Research Methodology\Wooldridge Econometrics resources\data\CRIME2.DTA", clear

_cons .0764511 .8211433 0.09 0.926 -1.580683 1.733585 lcrmrt_1 1.193923 .1320985 9.04 0.000 .9273371 1.460508 llawexpc -.1395764 .1086412 -1.28 0.206 -.3588231 .0796704 unem .008621 .0195166 0.44 0.661 -.0307652 .0480072 lcrmrte Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lcrmrte unem llawexpc lcrmrt_1 if year==87

Without the lag of dependent varriable.

With the lag of dependent variable.

Measurement errorMeasurement error The existence of important omitted

variables causes endogeneity problem.

Another source of endogeneity is the measurement error.

This section explains under what circumstance the measurement error causes endogeneity, and under what circumstance it does not.

14

Measurement error in Measurement error in explanatory variable.explanatory variable.

When the explanatory variables are measured with errors, this causes the endogeneity problem.

This is a common problem. For example, in a typical survey, the respondents may report their annual incomes with a lot of errors. Variables such as GPA or IQ may be reported with errors as well.

15

Now, let us understand the nature of the problem. Suppose that you want to estimate the following simple

regression.

y=β0+β1x1* +u …………………….(1)

where x1* is the measurement-error free variable. Suppose that this regression satisfies MLR.1 through MLR.4.

Now, suppose that you only observe the error-ridden variable x1. That is

x1=x1*+e1 where e1 is a random error uncorrelated with x1*.

16

To be more precise, the measurement error is such that

x1=x1*+e1 …………….(2) andCov(x1*, e1)=0 ………….(3)

(2) and (3) is called the classical errors-in-variables (CEV) assumption.

Note that the above assumption has nothing to do with u. We maintain the assumption that u is uncorrelated with both x1* and x1. This also means that u is uncorrelated with e1.

17

Because we only observe the error-ridden variable x1, we can only estimate the following model.

y=β0+β1x1+v…….(4)

Under the CEV assumption, the observed (error-ridden) variable in regression (4) is endogenous.

To see this, plug x1*=x1-e1 into the original regression (1) to get

y=β0+β1x1+(u- β1e1)…….(5) 18

So, we have v=u- β1e1

Now, notice that

Cov(x1, v)=Cov(x1, u- β1e1)= ≠0See the front board for the proof.

Therefore, x1 is correlated with the error term. Therefore, x1 is endogenous. Thus, OLS will be biased.

19

21 1e

Under the CEV assumption, we can predict the direction of the inconsistency (characterization of the bias is difficult). Let be the estimated coefficient from the error-ridden variable regression (4). Then, we have

20

22

2

111

*1

*1)ˆlim(ex

xp

1̂

Proof: see the front board

Since the term inside the parenthesis is always smaller than 1, there is a bias towards zero. This is called the attenuation bias.

Error in variable (more Error in variable (more general case)general case)

Suppose you want to estimate the following model.

y=β0+β1x1*+β2x2+….+βkxk+u

where x1* is measurement free variable. However, you only observe error-ridden

variable x1. So you can only estimate the above regression by replacing x1* with x1.

21

Assume that other variables are measurement error free.

Then the probability limit of is given by

22

1̂

22

2

111

*1

*1)ˆlim(

er

rp

where is the population error from the following regression.

x1=δ0+δ1x2+…+ δk-1xk+ r1*

2*

1r

Measurement error in the Measurement error in the dependent variabledependent variable

When the measurement error is in the dependent variable, but explanatory variables have no measurement-errors, there will be no bias in OLS.

Consider the following model. y*=β0+β1x1 +u …………………….(1) where y* is the measurement free variable. But, you only observe the error-ridden

variable y.23

Assume the followingy=y*+e ………………………………………….…….(2) andCov(y, e)=0 ……………………………………………...(3)

Again, we maintain the assumption that u is uncorrelated with both x1* and x1. This also means that u is uncorrelated with e1.

By plugging y*=y-e into (1), we have the following OLS.

y=β0+β1x1 +(u+e) ……………(5) Since e and u are not correlated with the

explanatory variables, (5) causes no bias in the estimation.

24

Non random samplingNon random sampling1: Exogenous sampling1: Exogenous sampling

Consider the following regression Saving=β0+β1(income)+β2(age)+u Suppose that the survey is conducted for people over

35 years old. This is non-random sampling, but the sampling criteria is based on the independent variable. This is called the sample selection based on the independent variables, and is an example of exogenous sample selection.

In this case, OLS regression of the above model has no bias.

25

Non random samplingNon random sampling2: Enogenous sampling2: Enogenous sampling

Consider the following regression.Wealth=β0+β1(Educ)+β2(Exper)+u However, suppose that only people with wealth below

$250,000 are included in the sample. Then the sample selection criteria is based on the dependent variable. This is called the sample selection based on dependent variable, and is an example of endogenous sample selection.

In this case, OLS estimate of the above regression are always biased.

26

Stratified samplingStratified sampling This is a common survey method, in

which the population is divided into non-overlapping groups, or strata. The sampling is random within each group.

However, some groups are often oversampled in order to increase observations for that group. Whether this causes the bias depends on whether the selection is exogenous or endogenous.

27

If females are oversampled, and you are interested in the gender differences in savings, then this is the exogenous sample selection. Thus, this causes no bias.

If people with low wealth are oversampled, and if you are interested in the wealth regression, then this is endogenous sample selection. This causes a bias in the regression.

28

More subtle form of More subtle form of sample selection.sample selection.

Suppose that you are interested in estimating the wage offer regression.

Low(wage offer)= β0+β1(Educ)+β2(Exper)+u

When the wage offer is `too low’ for a particular person, the person may decide not to work. Thus, this person will not be included in the sample. This is the case where sample selection is caused by the person’s decision to work or not.

29

When the decision is based on unobserved factors, then the OLS regression causes a bias. This is called the sample selection bias.

This is typically a problem for the study of the wage offer for women.

This course does not cover the method to correct for this type of bias. In the fall semester, I will cover this type of issues in a new course `the Cross Section and Panel Data Analysis’.

30

Date post:	07-Feb-2016
Category:	Documents
Upload:	brant
View:	33 times
Download:	0 times