1. Overview
• Do-files
• Sorting a dataset
• Combining datasets
• Creating a dataset of means or medians etc.
• Weights
• Panel data capabilities
• Dummy variables
• Lags
3. Do-files
• Do-files allow commands to be saved and executed in “batch” form.
• We will use the Stata do-file editor to write do-files.
• To open do-file editor click Window Do-File Editor or click
• Could also use WordPad or Notepad: Save as “Text Document” with extension “.do” (instead of “.txt”).
4. Do-files (cont.)
• To run a do-file from within the do-file editor, either select Tools Do or click
• If you highlight certain lines of code, only those commands will run.
• To run do-file from the main Stata windows, either select File Do or type:
do dofilename
• Can “comment out” lines by preceding with * or by enclosing text within /* and */.
5. Do-files (cont.)
• Can save the contents of the Review window as a do-file by right-clicking on window and selecting “Save All...”.
6. Sorting a dataset
• sort puts the observations in a specific order.
• To sort by alphabetic or numeric order of
varname, use:
sort varname
• You can sort a file on more than one variable:
sort varlist
• Example:
sort country year
7. Appending datasets
• To add another Stata dataset below the end of the dataset in memory, type:
append using filename
• Dataset in memory is called “master dataset”.
• Dataset filename is called “using dataset”.
• Variables (i.e. with same name) in both datasets will be combined.
• Variables in only one dataset will have missing values for observations from the other dataset.
8. Merging datasets
• To join corresponding observations from a Stata dataset with those in the dataset in memory, type:
merge 1:1 varlist using filename
• Stata will join observations with common values of varlist, which must be present in both datasets.
• If more than one observation has the same value(s) for varlist in the master dataset, use:
merge m:1 varlist using filename
• If more than one observation has the same value(s) for varlist in the using dataset, use:
merge 1:m varlist using filename
9. Merging datasets (cont.)
• The variable _merge is automatically added to the dataset, containing:
_merge==1 Observation from master data
_merge==2 Observation from using data
_merge==3 Observation from both master and using data
• Stata reports the number of observations with each value of _merge.
• Open the do-file editor in Stata. Run all your solutions to the exercises from here.
• Use cd to change the working directory to your preferred folder.
• Redo all the commands from Part I and recreate the two Stata files, “Economic data.dta” and “EU data.dta”:
do http://people.bath.ac.uk/
klp33/stata_part_one_solutions.do
EXERCISE 1
10. Merging
• Open “Economic data.dta” (master
dataset) and merge with “EU data.dta" (using
dataset) using country as the match variable.
• Should you use merge 1:1, merge m:1 or
merge 1:m?
• Look at the values that _merge takes: what does
this indicate?
EXERCISE 1 (cont.)
11. Merging
• Remove those observations that do not contain
data from both files:
drop if _merge==1
• Create a dummy variable called eu for whether a
country was a member of the EU in a given year.
EXERCISE 1 (cont.)
12. Merging
13. Collapsing datasets
• To create a dataset of means, sums etc., type:
collapse (stat) varlist1 (stat) …
[[weight]], by(varlist2)
• stat can be mean, sd, sum, median or other
statistics.
• by(varlist2) specifies the groups over which the
means etc. are to be calculated.
14. Collapsing datasets (cont.)
• Be sure to save data before attempting collapse as
there is no “undo” facility.
• Example:
collapse (mean) age educ (median)
income, by(country)
15. Collapsing datasets (cont.)
• Four types of weight can be used in Stata:
– fweight (frequency weights): weights
indicate the number of duplicated observations.
– pweight (sampling weights): weights denote
the inverse of the probability that an
observation is included in the sample.
16. Collapsing datasets (cont.)
– aweight (analytic weights): weights are
inversely proportional to the variance of an
observation to correct for heteroskedasticity.
Often, observations represent averages and
weights are number of elements that gave rise
to the average.
– iweight (importance weights): weights have
no other interpretation.
17. Collapsing datasets (cont.)
• Example:
collapse (mean) unemplrate
[aweight=labforce], by(country)
• Weights may be used in many other Stata
commands, e.g. correlate, regress.
• Note that the square brackets around the weight
must be typed.
EXERCISE 2
18. Collapsing
• Collapse the merged dataset from Exercise 1 by year to produce a dataset containing the sums of pop and area and the means of gdpcap, lfpr, unemplrate and secondary across the entire EU.
• Use aweight=pop so that the variables take into account the changing populations of the countries.
• In what years did the EU have the highest unemployment rate and the highest GDP per capita?
EXERCISE 2 (cont.)
19. Collapsing
• Looking at the data editor, can you spot a problem
with the collapsed data?
• (A more appropriate collapse step would use
rawsum rather than sum, which computes the
unweighted sum.)
• Do not save the new dataset.
20. Panel data manipulation
• Panel data generally refer to the repeated observation of a set of fixed entities at fixed intervals of time (also known as longitudinal data).
• Stata is particularly good at arranging and analysing panel data.
• Stata refers to two panel display formats:
– Wide form: useful for display purposes and often the form data obtained in.
– Long form: needed for regressions etc.
21. Panel data manipulation
(cont.)Example of wide form:
• Note the naming convention for inc.
id sex inc2008 inc2009 inc2010
1 0 5000 5500 6000
2 1 2000 2200 3300
3 0 3000 2000 1000
i xij
22. Panel data manipulation
(cont.)Example of long form:
id year sex inc
1 2008 0 5000
1 2009 0 5500
1 2010 0 6000
2 2008 1 2000
2 2009 1 2200
2 2010 1 3300
3 2008 0 3000
3 2009 0 2000
3 2010 0 1000
i j xij
23. Panel data manipulation
(cont.)• To change from long to wide form, type:
reshape wide varlist, i(ivarname) j(jvarname)
• varlist is the list of variables to be converted from long to wide form.
• i(ivarname) specifies the variable(s) whose unique values denote the spatial unit.
• j(jvarname) specifies the variable whose unique values denote the time period.
24. Panel data manipulation
(cont.)
• To change from wide to long form, type:
reshape long stublist, i(ivarname)
j(jvarname)
• stublist is the “word” part of the names of
variables to be converted from wide to long form,
e.g. “inc” above.
• It is important to name variables in this format, i.e.
word description followed by year.
25. Panel data manipulation
(cont.)• To move between the above example datasets use:
reshape long inc, i(id) j(year)
reshape wide inc, i(id) j(year)
• These steps “undo” each other.
26. Lags
• You can “declare” the data to be in panel form, with the tsset command:
tsset panelvar timevar
• For example:
tsset countryid year
• After using tsset, a lag can be created with:
gen lagname = L.varname
• Similarly, L2.varname gives the second lag.
EXERCISE 3
27. Manipulating a panel
• Open nlswork.dta from the internet using:
webuse nlswork
• Declare the data to be a panel using tsset, noting that idcode is the panel variable and year is the time variable.
• Generate a new variable lagwage equal to the lag of ln_wage and confirm that this contains the correct values by listing some data (use the break button):
list idcode year ln_wage lagwage
EXERCISE 3 (cont.)
28. Manipulating a panel
• Drop all variables other than idcode, year and ln_wage using the keep command (quicker than using drop).
• Use the reshape wide option to rearrange the data so that the first column represents each person (idcode) and the other columns contain ln_wage for a particular year.
• Return the data to long form (change wide to long in the command).
29. Univariate summary
statistics• tabstat produces a table of summary statistics:
tabstat varlist [, statistics(statlist)]
• Example:
tabstat age educ, stats(mean sd sdmean n)
• summarize displays a variety of univariate summary statistics (number of non-missing observations, mean, standard deviation, minimum, maximum):
summarize [varlist]
30. Multivariate summary
statistics• table displays table of statistics:
table rowvar [colvar] [, contents(clistvarname)]
• clist can be freq, mean, sum etc.
• rowvar and colvar may be numeric or string variables.
• Example:
table sex educ, c(mean age median
inc)
31. Multivariate summary
statistics (cont.)
• One “super-column” and up to 4 “super-rows” are
also allowed.
32. Sets of dummy variables
• Dummy variables take the values 0 and 1 only.
• Large sets of dummy variables can be created with:
tab varname, gen(dummyname)
• When using large numbers of dummies in regressions, useful to name with pattern, e.g. id1, id2… Then id* can be used to refer to all variables beginning with *.
33. Linear regression
• To perform a linear regression of depvar on varlist, type:
regress depvar varlist [[weight]] [if exp] [, noconstant robust]
• depvar is the dependent variable.
• varlist is the set of independent variables (regressors).
• By default Stata includes a constant. The noconstant option excludes it.
34. Linear regression (cont.)
• Weights may be used, e.g. when data are group averages, as in:
regress inflation unemplrate year
[aweight=pop]
• Note that here year allows for a linear time trend.
• Other estimation commands (e.g. probit) follow the same general syntax.
35. Post-estimation commands
• After any estimation command, several predicted values can be computed using predict.
• predict refers to the most recent model estimated.
• predict yhat, xb creates a new variable yhatequal to the predicted values of the dependent variable.
• predict res, residual creates a new variable res equal to the residuals.
36. Post-estimation commands
(cont.)• Linear hypotheses can be tested (e.g. t-test or F-
test) after estimating a model by using test.
• test varlist tests that the coefficients corresponding to every element in varlist jointly equal zero.
• test eqlist tests the restrictions in eqlist, e.g.:
test sex==3
• The option accumulate allows a hypothesis to be tested jointly with the previously tested hypotheses.
37. Post-estimation commands
(cont.)
• Example:
regress lnw sex race school age
test sex race
test school == age, accum
EXERCISE 4
38. Statistics and regression
• Clear the memory and open nlswork.dta again:
webuse nlswork
• Type summarize to look at the summary statistics for all variables in the dataset.
• Restrict summarize to hours and ln_wageand perform it separately for non-married and married (i.e. msp==0 and 1).
• Use tabstat to report the mean, median, minimum and maximum for hours and ln_wage.
EXERCISE 4 (cont.)
39. Statistics and regression
• Report the mean and median of ln_wage by age(along the rows) and race (across the columns) :
table age race, c(mean ln_wage
median ln_wage)
• Generate a variable called age2 that is equal to the square of age (the square operator in Stata is ^).
• Create a set of race dummies with:
tab race, gen(race)
EXERCISE 4 (cont.)
40. Statistics and regression
• Regress ln_wage on: age, age2, race2,
race3, msp, grade, tenure, c_city.
41. Graphs
• To obtain a basic histogram of varname, type:
histogram varname, discrete freq
• To display a scatterplot of two (or more) variables, type:
scatter varlist [[weight]]
• weight determines the diameter of the markers used in the scatterplot.
42. Graphs (cont.)
• There are options for (among other things):
– Adding a title (title)
– Altering the scale of the axes (xscale,
yscale)
– Specifying what axis labels to use (xlabel,
ylabel)
– Changing the markers used (msymbol)
– Changing the connecting lines (connect)
43. Graphs (cont.)
• Particularly useful is mlabel(varname) which
uses the values of varname as markers in the
scatterplot.
• Example:
scatter gdp unemplrate,
mlabel(country)