+ All Categories
Home > Documents > An Introduction to Stata - University of Bathpeople.bath.ac.uk/klp33/stata_part_two.pdf · •Stata...

An Introduction to Stata - University of Bathpeople.bath.ac.uk/klp33/stata_part_two.pdf · •Stata...

Date post: 23-Feb-2019
Category:
Upload: buinhan
View: 215 times
Download: 0 times
Share this document with a friend
45
An Introduction to Stata Part II: Data Analysis Kerry L. Papps
Transcript

An Introduction to Stata

Part II:

Data Analysis

Kerry L. Papps

1. Overview

• Do-files

• Sorting a dataset

• Combining datasets

• Creating a dataset of means or medians etc.

• Weights

• Panel data capabilities

• Dummy variables

• Lags

2. Overview (cont.)

• Summary statistics

• Basic regression commands

3. Do-files

• Do-files allow commands to be saved and executed in “batch” form.

• We will use the Stata do-file editor to write do-files.

• To open do-file editor click Window Do-File Editor or click

• Could also use WordPad or Notepad: Save as “Text Document” with extension “.do” (instead of “.txt”).

4. Do-files (cont.)

• To run a do-file from within the do-file editor, either select Tools Do or click

• If you highlight certain lines of code, only those commands will run.

• To run do-file from the main Stata windows, either select File Do or type:

do dofilename

• Can “comment out” lines by preceding with * or by enclosing text within /* and */.

5. Do-files (cont.)

• Can save the contents of the Review window as a do-file by right-clicking on window and selecting “Save All...”.

6. Sorting a dataset

• sort puts the observations in a specific order.

• To sort by alphabetic or numeric order of

varname, use:

sort varname

• You can sort a file on more than one variable:

sort varlist

• Example:

sort country year

7. Appending datasets

• To add another Stata dataset below the end of the dataset in memory, type:

append using filename

• Dataset in memory is called “master dataset”.

• Dataset filename is called “using dataset”.

• Variables (i.e. with same name) in both datasets will be combined.

• Variables in only one dataset will have missing values for observations from the other dataset.

8. Merging datasets

• To join corresponding observations from a Stata dataset with those in the dataset in memory, type:

merge 1:1 varlist using filename

• Stata will join observations with common values of varlist, which must be present in both datasets.

• If more than one observation has the same value(s) for varlist in the master dataset, use:

merge m:1 varlist using filename

• If more than one observation has the same value(s) for varlist in the using dataset, use:

merge 1:m varlist using filename

9. Merging datasets (cont.)

• The variable _merge is automatically added to the dataset, containing:

_merge==1 Observation from master data

_merge==2 Observation from using data

_merge==3 Observation from both master and using data

• Stata reports the number of observations with each value of _merge.

• Open the do-file editor in Stata. Run all your solutions to the exercises from here.

• Use cd to change the working directory to your preferred folder.

• Redo all the commands from Part I and recreate the two Stata files, “Economic data.dta” and “EU data.dta”:

do http://people.bath.ac.uk/

klp33/stata_part_one_solutions.do

EXERCISE 1

10. Merging

• Open “Economic data.dta” (master

dataset) and merge with “EU data.dta" (using

dataset) using country as the match variable.

• Should you use merge 1:1, merge m:1 or

merge 1:m?

• Look at the values that _merge takes: what does

this indicate?

EXERCISE 1 (cont.)

11. Merging

• Remove those observations that do not contain

data from both files:

drop if _merge==1

• Create a dummy variable called eu for whether a

country was a member of the EU in a given year.

EXERCISE 1 (cont.)

12. Merging

13. Collapsing datasets

• To create a dataset of means, sums etc., type:

collapse (stat) varlist1 (stat) …

[[weight]], by(varlist2)

• stat can be mean, sd, sum, median or other

statistics.

• by(varlist2) specifies the groups over which the

means etc. are to be calculated.

14. Collapsing datasets (cont.)

• Be sure to save data before attempting collapse as

there is no “undo” facility.

• Example:

collapse (mean) age educ (median)

income, by(country)

15. Collapsing datasets (cont.)

• Four types of weight can be used in Stata:

– fweight (frequency weights): weights

indicate the number of duplicated observations.

– pweight (sampling weights): weights denote

the inverse of the probability that an

observation is included in the sample.

16. Collapsing datasets (cont.)

– aweight (analytic weights): weights are

inversely proportional to the variance of an

observation to correct for heteroskedasticity.

Often, observations represent averages and

weights are number of elements that gave rise

to the average.

– iweight (importance weights): weights have

no other interpretation.

17. Collapsing datasets (cont.)

• Example:

collapse (mean) unemplrate

[aweight=labforce], by(country)

• Weights may be used in many other Stata

commands, e.g. correlate, regress.

• Note that the square brackets around the weight

must be typed.

EXERCISE 2

18. Collapsing

• Collapse the merged dataset from Exercise 1 by year to produce a dataset containing the sums of pop and area and the means of gdpcap, lfpr, unemplrate and secondary across the entire EU.

• Use aweight=pop so that the variables take into account the changing populations of the countries.

• In what years did the EU have the highest unemployment rate and the highest GDP per capita?

EXERCISE 2 (cont.)

19. Collapsing

• Looking at the data editor, can you spot a problem

with the collapsed data?

• (A more appropriate collapse step would use

rawsum rather than sum, which computes the

unweighted sum.)

• Do not save the new dataset.

20. Panel data manipulation

• Panel data generally refer to the repeated observation of a set of fixed entities at fixed intervals of time (also known as longitudinal data).

• Stata is particularly good at arranging and analysing panel data.

• Stata refers to two panel display formats:

– Wide form: useful for display purposes and often the form data obtained in.

– Long form: needed for regressions etc.

21. Panel data manipulation

(cont.)Example of wide form:

• Note the naming convention for inc.

id sex inc2008 inc2009 inc2010

1 0 5000 5500 6000

2 1 2000 2200 3300

3 0 3000 2000 1000

i xij

22. Panel data manipulation

(cont.)Example of long form:

id year sex inc

1 2008 0 5000

1 2009 0 5500

1 2010 0 6000

2 2008 1 2000

2 2009 1 2200

2 2010 1 3300

3 2008 0 3000

3 2009 0 2000

3 2010 0 1000

i j xij

23. Panel data manipulation

(cont.)• To change from long to wide form, type:

reshape wide varlist, i(ivarname) j(jvarname)

• varlist is the list of variables to be converted from long to wide form.

• i(ivarname) specifies the variable(s) whose unique values denote the spatial unit.

• j(jvarname) specifies the variable whose unique values denote the time period.

24. Panel data manipulation

(cont.)

• To change from wide to long form, type:

reshape long stublist, i(ivarname)

j(jvarname)

• stublist is the “word” part of the names of

variables to be converted from wide to long form,

e.g. “inc” above.

• It is important to name variables in this format, i.e.

word description followed by year.

25. Panel data manipulation

(cont.)• To move between the above example datasets use:

reshape long inc, i(id) j(year)

reshape wide inc, i(id) j(year)

• These steps “undo” each other.

26. Lags

• You can “declare” the data to be in panel form, with the tsset command:

tsset panelvar timevar

• For example:

tsset countryid year

• After using tsset, a lag can be created with:

gen lagname = L.varname

• Similarly, L2.varname gives the second lag.

EXERCISE 3

27. Manipulating a panel

• Open nlswork.dta from the internet using:

webuse nlswork

• Declare the data to be a panel using tsset, noting that idcode is the panel variable and year is the time variable.

• Generate a new variable lagwage equal to the lag of ln_wage and confirm that this contains the correct values by listing some data (use the break button):

list idcode year ln_wage lagwage

EXERCISE 3 (cont.)

28. Manipulating a panel

• Drop all variables other than idcode, year and ln_wage using the keep command (quicker than using drop).

• Use the reshape wide option to rearrange the data so that the first column represents each person (idcode) and the other columns contain ln_wage for a particular year.

• Return the data to long form (change wide to long in the command).

29. Univariate summary

statistics• tabstat produces a table of summary statistics:

tabstat varlist [, statistics(statlist)]

• Example:

tabstat age educ, stats(mean sd sdmean n)

• summarize displays a variety of univariate summary statistics (number of non-missing observations, mean, standard deviation, minimum, maximum):

summarize [varlist]

30. Multivariate summary

statistics• table displays table of statistics:

table rowvar [colvar] [, contents(clistvarname)]

• clist can be freq, mean, sum etc.

• rowvar and colvar may be numeric or string variables.

• Example:

table sex educ, c(mean age median

inc)

31. Multivariate summary

statistics (cont.)

• One “super-column” and up to 4 “super-rows” are

also allowed.

32. Sets of dummy variables

• Dummy variables take the values 0 and 1 only.

• Large sets of dummy variables can be created with:

tab varname, gen(dummyname)

• When using large numbers of dummies in regressions, useful to name with pattern, e.g. id1, id2… Then id* can be used to refer to all variables beginning with *.

33. Linear regression

• To perform a linear regression of depvar on varlist, type:

regress depvar varlist [[weight]] [if exp] [, noconstant robust]

• depvar is the dependent variable.

• varlist is the set of independent variables (regressors).

• By default Stata includes a constant. The noconstant option excludes it.

34. Linear regression (cont.)

• Weights may be used, e.g. when data are group averages, as in:

regress inflation unemplrate year

[aweight=pop]

• Note that here year allows for a linear time trend.

• Other estimation commands (e.g. probit) follow the same general syntax.

35. Post-estimation commands

• After any estimation command, several predicted values can be computed using predict.

• predict refers to the most recent model estimated.

• predict yhat, xb creates a new variable yhatequal to the predicted values of the dependent variable.

• predict res, residual creates a new variable res equal to the residuals.

36. Post-estimation commands

(cont.)• Linear hypotheses can be tested (e.g. t-test or F-

test) after estimating a model by using test.

• test varlist tests that the coefficients corresponding to every element in varlist jointly equal zero.

• test eqlist tests the restrictions in eqlist, e.g.:

test sex==3

• The option accumulate allows a hypothesis to be tested jointly with the previously tested hypotheses.

37. Post-estimation commands

(cont.)

• Example:

regress lnw sex race school age

test sex race

test school == age, accum

EXERCISE 4

38. Statistics and regression

• Clear the memory and open nlswork.dta again:

webuse nlswork

• Type summarize to look at the summary statistics for all variables in the dataset.

• Restrict summarize to hours and ln_wageand perform it separately for non-married and married (i.e. msp==0 and 1).

• Use tabstat to report the mean, median, minimum and maximum for hours and ln_wage.

EXERCISE 4 (cont.)

39. Statistics and regression

• Report the mean and median of ln_wage by age(along the rows) and race (across the columns) :

table age race, c(mean ln_wage

median ln_wage)

• Generate a variable called age2 that is equal to the square of age (the square operator in Stata is ^).

• Create a set of race dummies with:

tab race, gen(race)

EXERCISE 4 (cont.)

40. Statistics and regression

• Regress ln_wage on: age, age2, race2,

race3, msp, grade, tenure, c_city.

41. Graphs

• To obtain a basic histogram of varname, type:

histogram varname, discrete freq

• To display a scatterplot of two (or more) variables, type:

scatter varlist [[weight]]

• weight determines the diameter of the markers used in the scatterplot.

42. Graphs (cont.)

• There are options for (among other things):

– Adding a title (title)

– Altering the scale of the axes (xscale,

yscale)

– Specifying what axis labels to use (xlabel,

ylabel)

– Changing the markers used (msymbol)

– Changing the connecting lines (connect)

43. Graphs (cont.)

• Particularly useful is mlabel(varname) which

uses the values of varname as markers in the

scatterplot.

• Example:

scatter gdp unemplrate,

mlabel(country)

44. Graphs (cont.)

• Graphs are not saved by log files (separate

windows).

• Select File Save Graph.

• To insert in a Word document etc., select Edit

Copy and then paste into Word document. This

can be resized but is not interactive (unlike Excel

charts etc.).


Recommended