Data Structures

Data Structures

Methods of Economic Investigation

Lecture 2

Why are we doing this? Thus far: Most of econometrics teaching has

been theory based Type of data can drive what you can do Type of data affects credibility and problems with

analysis

Can be hard to translate equations into applications and even into reading papers

Rest of this course based on applications: this lecture will help with both lectures and exercises

Choosing your data.. Suppose interested in causal effect of X on y: How

would you test this?

If you could choose the way in which X is determined in your sample—what would you do may seem fanciful but field experiments becoming

more common in economics Good thought experiment: If you could have any data in

the world, is this question answerable (if not, move on!)

Good reason to choose to do randomized controlled experiment

Where does data come from? Surveys

Response Rate Stratification/Clusters Reporting Error/Measurement Error

Administrative Records Lots of different places Often kept real-time (so addresses “reporting” or

“recollection” errors) May be missing, and that might not be random…

Researchers (and you!) Often collected for specific project—so be careful what it

has More “unique” with different types of data (e.g. content

analysis)

Who Collects Data Government

Official Statistics: Unemployment, GDP, etc Surveys: Labor Force, Consumption, etc. Records: Justice System, Social Programs

Service providers Often this may be administrative (e.g. hospital records) Sometimes, internal surveys or evaluations which can be

useful if you can get them

Third Parties Critical for places with limited capacity (e.g. World Bank

is a big source of this for developing countries) University or Survey Research Programs Newspapers and Media sources compile LOTS of things

Different Types of data Cross-Sectional Data

Time Series Data

Panel Data

Repeated Cross-Section

Cross-Sectional Data Cross section data covers a cross section

of population and information is collected from this cross section during a given period of time.

What does this look like Rows are units of observations (e.g.

individuals) Columns may be variables

Cross-Section Data Simple descriptive statistics across

individuals: can get sample mean and variance of various X’s

Regressions: The standard formulas

'Xy

yXXX ')'( 1

AlgebraReality: Outcome Variables Try to get a sense of data, to translate the

matrix algebra into reality.

What is the effect of education on income?

We have an Outcome “y”, for example income

ny

y

y

y2

1

AlgebraReality: RHS Variables There may be several (labeled by k)

different X’s. So usually we think of this as meaning that: X is of dimensionality kxn We will estimate k coefficients

Our X variables looks like:

knn

k

xx

xx

X

1

111

Our Data Looks like:ID Income Race Sex Education

1 y1 x11 x21 x31

2 y2 x12 x22 x32

3 y3 x13 x23 x33

4 y4 x14 x24 x34

5 y5 x15 x25 x35

Our Data Example N=5 k=3We can index our individuals by ID (useful later)

What does a regression tell us? Remember, it’s minimizing the errors and

will pick the 3 coefficients (one on race, one on sex, and one on education) to do that

We are interested in the coefficient on education to tell use the “effect of education on earnings”

We might still care about the effect of race and gender as “control” variables

Stata Output

AlgebraReality: Stata Output Using our “data” if we regress y on our X’s

To do this in stata we would tell stata: regress income race sex education

Output: Coefficients Standard errors R-squared

Limitations… Lots of things vary over time

Can’t control for these issues in cross-section data

Only source of variation is across individuals (or whatever the unit of observation)

Identification: Need observations similar time characteristics (because we can’t control this) but different on some variable of interest

Now to time series data Pretty similar to panel data except data indexed

by time instead of individual

Year Income Inflation Growth Unempl

2000 y1 x11 x21 x31

2001 y2 x12 x22 x32

2002 y3 x13 x23 x33

2003 y4 x14 x24 x34

2004 y5 x15 x25 x35

Why is time series different? Correlation between different observations

Violates OLS assumptions (estimates ok but can’t do inference)

More on this later…

Lots of things about individuals are time-invariant so they don’t make sense in this context. Other things, often in time series data, are common

across individuals (e.g. macroeconomic trends) Limits what we can do with these variables—we CAN’T

“control” for time-invariant characteristics so all variation comes from time variation…

Estimating with Time Series Data Two critical issues:

Stationary: Mean and Variance not changing over time

Stronger conditions sometimes required which is that distribution (e.g. all moments) same over time/space

May need to do something to make your data stationary (e.g. de-mean, detrend, difference, etc.)

Ergodic Given a sufficiently long set of realizations, can

estimate statistical properties Worry about Unit roots (more on this later)

Panel Data Repeated observation on individuals

Common example: Labor Force Surveys Take information about individuals Usually contains time invarying for any

individual (race, sex, education level) Usually contains time varying for any given

individual (employed last week) Can contain or link to time varying but same

across groups of individuals (local unemployment rate)

Example of Panel Data Multi-dimensional—so indexed by time & individual

ID Year Income Employment Sex Education

1 2000 Y1,2000 X11,2000 X21,2000 X31,2000

1 2001 Y1,2001 X11,2001 X21,2001 X31,2001

2 2000 Y2,2000 X12, 2000 X22,2000 X32,2000

2 2001 Y2,2001 X12, 2001 X22,2001 X32,2001

2 2002 Y2,2002 X12,2002 X22,2002 X32,2002

Panel Data Regressions Regressions need to be indexed by all

dimensions (our example is time and individual but it could be time, state, and individual)

May allow intercept shift (e.g. add a dummy for each year)

May allow a slope shift (e.g. allow different coefficients for men and women)

ititit Xy '

What’s so great about Panel Data? We can control for individual specific

factors (e.g. error component models) ECM may solve some of our omitted variable

bias issues (individual controls) Can use both “within” (for an individual over

time) and “between variation (across individuals in a given time)

Can be rare to have long panels Tend to span very short periods of time May make it difficult to study trends—can only

see “breaks” at big changes

Repeated Cross-Section Data More common—Annual or Frequent

Surveys—not always same people

Get repeated cross-section, of different cohorts of individuals

Can do several things: Construct panel at more aggregate level Use time-series aspects to compare cohorts

Example of Cross-Section Data Multi-dimensional—so indexed by time & individual

ID Year Income Employment Sex Education

1 2000 Y1,2000 X11,2000 X21,2000 X31,2000

2 2001 Y2,2001 X12,2001 X22,2001 X32,2001

3 2000 Y3,2000 X13, 2000 X23,2000 X33,2000

4 2001 Y4,2001 X14, 2001 X24,2001 X34,2001

5 2002 Y5,2002 X15,2002 X25,2002 X35,2002

Repeated Cross-Section Regressions Index by time and whatever “group” you

want to use—for example: group 1 is men and group 2 is women, then you estimate:

Use similarities between groups but can’t control of individual specific issues Cohort specific changes—selection issues, e.g. Can allow ‘fixed effects’ for time or group—but

not as believable to control for unobservables

gtgtgt Xy '

Next Steps: Using data can we:

Describe the data to understand what we’ve got

Develop some “questions” to answer Test our hypotheses

Application based class—will use Stata examples

Date post:	31-Dec-2015
Category:	Documents
Upload:	debra-baldwin
View:	14 times
Download:	0 times

Data Structures

Documents