Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | debra-baldwin |
View: | 14 times |
Download: | 0 times |
Why are we doing this? Thus far: Most of econometrics teaching has
been theory based Type of data can drive what you can do Type of data affects credibility and problems with
analysis
Can be hard to translate equations into applications and even into reading papers
Rest of this course based on applications: this lecture will help with both lectures and exercises
Choosing your data.. Suppose interested in causal effect of X on y: How
would you test this?
If you could choose the way in which X is determined in your sample—what would you do may seem fanciful but field experiments becoming
more common in economics Good thought experiment: If you could have any data in
the world, is this question answerable (if not, move on!)
Good reason to choose to do randomized controlled experiment
Where does data come from? Surveys
Response Rate Stratification/Clusters Reporting Error/Measurement Error
Administrative Records Lots of different places Often kept real-time (so addresses “reporting” or
“recollection” errors) May be missing, and that might not be random…
Researchers (and you!) Often collected for specific project—so be careful what it
has More “unique” with different types of data (e.g. content
analysis)
Who Collects Data Government
Official Statistics: Unemployment, GDP, etc Surveys: Labor Force, Consumption, etc. Records: Justice System, Social Programs
Service providers Often this may be administrative (e.g. hospital records) Sometimes, internal surveys or evaluations which can be
useful if you can get them
Third Parties Critical for places with limited capacity (e.g. World Bank
is a big source of this for developing countries) University or Survey Research Programs Newspapers and Media sources compile LOTS of things
Cross-Sectional Data Cross section data covers a cross section
of population and information is collected from this cross section during a given period of time.
What does this look like Rows are units of observations (e.g.
individuals) Columns may be variables
Cross-Section Data Simple descriptive statistics across
individuals: can get sample mean and variance of various X’s
Regressions: The standard formulas
'Xy
yXXX ')'( 1
AlgebraReality: Outcome Variables Try to get a sense of data, to translate the
matrix algebra into reality.
What is the effect of education on income?
We have an Outcome “y”, for example income
ny
y
y
y2
1
AlgebraReality: RHS Variables There may be several (labeled by k)
different X’s. So usually we think of this as meaning that: X is of dimensionality kxn We will estimate k coefficients
Our X variables looks like:
knn
k
xx
xx
X
1
111
Our Data Looks like:ID Income Race Sex Education
1 y1 x11 x21 x31
2 y2 x12 x22 x32
3 y3 x13 x23 x33
4 y4 x14 x24 x34
5 y5 x15 x25 x35
Our Data Example N=5 k=3We can index our individuals by ID (useful later)
What does a regression tell us? Remember, it’s minimizing the errors and
will pick the 3 coefficients (one on race, one on sex, and one on education) to do that
We are interested in the coefficient on education to tell use the “effect of education on earnings”
We might still care about the effect of race and gender as “control” variables
AlgebraReality: Stata Output Using our “data” if we regress y on our X’s
To do this in stata we would tell stata: regress income race sex education
Output: Coefficients Standard errors R-squared
Limitations… Lots of things vary over time
Can’t control for these issues in cross-section data
Only source of variation is across individuals (or whatever the unit of observation)
Identification: Need observations similar time characteristics (because we can’t control this) but different on some variable of interest
Now to time series data Pretty similar to panel data except data indexed
by time instead of individual
Year Income Inflation Growth Unempl
2000 y1 x11 x21 x31
2001 y2 x12 x22 x32
2002 y3 x13 x23 x33
2003 y4 x14 x24 x34
2004 y5 x15 x25 x35
Why is time series different? Correlation between different observations
Violates OLS assumptions (estimates ok but can’t do inference)
Lots of things about individuals are time-invariant so they don’t make sense in this context. Other things, often in time series data, are common
across individuals (e.g. macroeconomic trends) Limits what we can do with these variables—we CAN’T
“control” for time-invariant characteristics so all variation comes from time variation…
Estimating with Time Series Data Two critical issues:
Stationary: Mean and Variance not changing over time
Stronger conditions sometimes required which is that distribution (e.g. all moments) same over time/space
May need to do something to make your data stationary (e.g. de-mean, detrend, difference, etc.)
Ergodic Given a sufficiently long set of realizations, can
estimate statistical properties Worry about Unit roots (more on this later)
Panel Data Repeated observation on individuals
Common example: Labor Force Surveys Take information about individuals Usually contains time invarying for any
individual (race, sex, education level) Usually contains time varying for any given
individual (employed last week) Can contain or link to time varying but same
across groups of individuals (local unemployment rate)
Example of Panel Data Multi-dimensional—so indexed by time & individual
ID Year Income Employment Sex Education
1 2000 Y1,2000 X11,2000 X21,2000 X31,2000
1 2001 Y1,2001 X11,2001 X21,2001 X31,2001
2 2000 Y2,2000 X12, 2000 X22,2000 X32,2000
2 2001 Y2,2001 X12, 2001 X22,2001 X32,2001
2 2002 Y2,2002 X12,2002 X22,2002 X32,2002
Panel Data Regressions Regressions need to be indexed by all
dimensions (our example is time and individual but it could be time, state, and individual)
May allow intercept shift (e.g. add a dummy for each year)
May allow a slope shift (e.g. allow different coefficients for men and women)
ititit Xy '
What’s so great about Panel Data? We can control for individual specific
factors (e.g. error component models) ECM may solve some of our omitted variable
bias issues (individual controls) Can use both “within” (for an individual over
time) and “between variation (across individuals in a given time)
Can be rare to have long panels Tend to span very short periods of time May make it difficult to study trends—can only
see “breaks” at big changes
Repeated Cross-Section Data More common—Annual or Frequent
Surveys—not always same people
Get repeated cross-section, of different cohorts of individuals
Can do several things: Construct panel at more aggregate level Use time-series aspects to compare cohorts
Example of Cross-Section Data Multi-dimensional—so indexed by time & individual
ID Year Income Employment Sex Education
1 2000 Y1,2000 X11,2000 X21,2000 X31,2000
2 2001 Y2,2001 X12,2001 X22,2001 X32,2001
3 2000 Y3,2000 X13, 2000 X23,2000 X33,2000
4 2001 Y4,2001 X14, 2001 X24,2001 X34,2001
5 2002 Y5,2002 X15,2002 X25,2002 X35,2002
Repeated Cross-Section Regressions Index by time and whatever “group” you
want to use—for example: group 1 is men and group 2 is women, then you estimate:
Use similarities between groups but can’t control of individual specific issues Cohort specific changes—selection issues, e.g. Can allow ‘fixed effects’ for time or group—but
not as believable to control for unobservables
gtgtgt Xy '