Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
1
Homework I: Stata Guide
This will serve as a guide for you to learn Stata. A program used to process data for statistical inference. These instructions will aid you in completing your first homework assignment. If anything, really anything, is unclear four of your best resources will be: I. Office Hours (found on TED). II. Use the help command in Stata; help x or google help x stata. Replace x with the command that you are unsure of. III. E-mail [email protected] IV. http://www.ats.ucla.edu/stat/stata/modules/default.htm a great self help guide from UCLA. Commands will be in bold (type the phrase in bold then hit enter). describe will show the variables contained in the dataset. Stata is extremely case sensitive. If you enter a command and the variable cannot be found; it is possible that you entered happins, not Happins. Clicking will be in italics. Title will be what you click, -> indicates what you click next. E.g, File-> open-> documents-> school-> Stata Homework -> dataset.dta.
I. Logistics
A) If you are using your own computer then this first step may be redundant. Once you open Stata, clear will remove all previous variables in the program. This will ensure that the only variables in Stata are related to the homework assignment. B) set more off will make the analysis run faster. However, if you have a fast computer this may not be necessary. C) set mem 15 (only if you run Stata 11 or earlier, which is unlikely if you use it through VCL or a UCSD computer). D) cap log close this will close the existing log file. A log file is what records what is done in Stata. E) Choose Working Directory: File -> Change Working Directory -> select a folder F) Create a log file in which the results of the programming will be saved. E.g: File -> log -> begin -> selected the folder where you want to save it -> pick a name -> Save it
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
2
G) Open the dataset (dta file). File -> Open -> Find and select the file country_happiness.dta H) Save your data as a new file. This will make sure that you do not tamper with the original file. File -> Save As -> selected the folder where you want to save it -> pick a name -> Save
IA. Analysis: Happiness
A) describe allows you to see what variables are contained in the dataset. The dataset contains information about socioeconomic, and happiness scores for 75 countries. describe happins gdp2002 (the two variables that we are interested in for this homework assignment).
B) summarize will give you summary statistics on the variables that you enter. It will give you: number of observations, mean, Std. Dev., min/max values. summarize happins summarize gdp2002
C) sort will re-arrange the variable in ascending order. This will allow us to see which countries are the happiest/saddest sort happins browse will show you the data in cell-format (like excel). Enter the command to see for yourself that the variables are re-arranged
D) We want to find the least/most happy country in the dataset. In order to do so, we will use list. * _N is the total number of observations. * _n is the observation/ row number. E.g, _n==5 is the fifth unhappiest country in the dataset. i) To find the unhappiest country: list country_name happins if _n==1 ii) To find the happiest country: list country_name happins if _n==_N * List can be used to find the happiness index of particular countries. We want to see how happy people are in USA and Italy: iii) list cty happins if country_name == “United States” iv) list cty happins of country_name == “Italy”
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
3
E) We can use count to see how many countries that are happier than a specific country. Let’s see how many countries that are happier than the U.S. by using the happiness index for the U.S. It is also possible to see which those countries are, and values in between two countries: Portugal and USA. i) count if happins > 3.32452 ii) list country_name if happins > 3.32452 iii) a. list cty happins if country_name == “Portugal” b. List if happins > 2.9510 & happins < 3.32452
IB. Analysis: Religion
A. Religion is a string variable, non-numerical. Summarize won’t work for this.
Instead we will use the tabulate command. It gives us frequencies, percentage, and cumulative distribution for each religion type. * describe religion, see for yourself * tabulate religion
B. We can look at different countries to see what religion has a majority in a particular country. For example let’s see in which countries Shiites are in majority. It is also possible to see which countries that don’t practice certain religions. In order to do so we use != , does not equal command. Don’t forget quotation marks for string variables! i) list country_name happins religion if religion == “Shia Islam” ii) list country_name happins if religion != “Catholic Heavily”
C. Once again we want to look at the summary statistics for happiness scores and GDP/capita. * summarize happins gdp2002
D. As you saw previously, it is extremely easy to find the standard deviation, mean, etc. Let’s test your understanding of statistics by finding Std. Dev. manually in Stata. This will be done in a few steps. i) We need to create a variable for the deviation. We will subtract the mean from each observation of happins. generate happins_deviation = happins-3.043835 (we got the mean by using summarize happins).
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
4
ii) The deviation must be squared. generate happins_deviation_sq = happins_deviation^2 iii) Now it is time to add up all of the squared deviations. tabstat allows us to produce a table of statistics. tabstat happins_deviation_sq, statistics (sum) iv) In order to do calculations in Stata we use display. display 1+1, display 5*5, Display 1-1, Display 5/0 (j/k you can’t divide by 0). In order to get the sample variance we will divide the squared deviation by N-1. display 5.6961/74. Alternatively, display 5.6961/(_N-1). v) In order to get the Standard Deviation we need to take the square root of the sample variance. display sqrt(.07697432).
E. Now try to calculate the standard deviation for the GDP variable. i) generate gdp_deviation_sq = (gdp2002 – 14099.65)^2 ii) tabstat gdp_deviation_sq, statistics (sum) columns(variables) iii) display sqrt(1.05e+10/74) iv) The value won’t be exactly the same as the one shown by using summarize. This is due to rounding.
F. It is possible to plot the distribution using Stata graphical tools. We are to plot a normal distribution that has the same mean and standard deviation as happins. histogram happins, frequency normal
G. Let’s plot a histogram for GDP as well. histogram gdp2002, normal
H. We can look at the correlation between GDP and Happiness in two ways. i) corr happins gdp2002, which gives us the correlation between happiness and GDP. ii) scatter happins gdp2002, which will graph a scatter plot of their relationship. iii)Save the graph. In the graph window File -> Save As -> Save as type: Portable document format (*.pdf) -> select the folder where you want to save it -> pick a name -> Save File
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
5
I. Let’s figure out what country is the one with a GDP/capita closets to $60,000. This could be hard doing by eye. Fortunately, we can add labels to the scatter plot. i) scatter happins gdp2002, mlab(country_name) mlabsize(small). Luxembourg should be that country. Notice the two axis, which are dataset labels for our two variables, happins gdp2002. The variable you type first will be displayed on the y-axis. ii) Let’s make the graph user friendly. We can do so by naming the graph and the axis. scatter happins gdp2002, mlabel(country_name) mlabelsize(vsmall) title(Scatterplot: Happiness Score and GDP/capita) ytitle(Happiness Score) xtitle(GDP/capita) iii) Outliers can be dangerous in Econometrics. If consider Luxembourg an outlier we can easily get rid of the observation. By adding an “if” option we can graph the scatter diagram without displaying Luxembourg. drop if country == “Luxembourg”
J. We are done with the analysis for these variables. However, let’s save the dataset and close the log file before moving on. i) File -> save
II. Analysis: Money
A) It is now time to use a different dataset. Before getting started we need to use
some of the commands from the logistics section on page 2. i) clear ii) set more off iii) cap log close iv) File -> log -> begin -> selected a folder -> pick a name -> Save it v) Open the dataset.. File -> open -> find and open CEOSAL1.dta vi) Then save the file before getting started. File -> Save As -> selected a folder -> pick a name -> Save it
B) It’s generally a good thing to look at the variable in the dataset. describe
C) The two variables of interest are CEO salaries and return on equity. list salary roe if _n <25
D) sum
E) The industry that the data is drawn from should give additional information. This is a discrete variable. To better way to describe this types of variables is
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
6
through the command tab indus
F) It is possible to look at the cross-tab of two discrete variables. The cross-tab reports the relative frequency within its row for each cell. In our example, it gives the conditional distribution of financial firms given that industrial firms take value 0 for the first row or 1 for the second row. It is essentially the conditional distribution of the column variable given the row variable. tabulate indus finance, row
G) We can also find the conditional distribution of the row variable given the column variable. tabulate indus finance, column
H) Lastly it is possible to get the joint distribution of industrial firms and financial firms. tabulate indus finance, cell
I) Let us look at the correlation between salary and return on equity while excluding potential outliers. corr salary roe if salary <5000
J) It is time to create another scatter plot. In order for the axis to be easier to read we are going to format them. We want to see how many CEOS make more than $5,000,000/year and how many companies that have ROE of 50% or higher. scatter salary roe, yline(5000) xline(50)
K) Let’s plot a histogram for salary. i) hist salary ii) histogram salary, normal (this compares the histogram of salary to a normal plot) iii) Different representations of incomes, e.g, salary, are usually represented as the natural log of salary. histogram lsalary, normal. This creates a histogram that is more traceable compared to the previous one.
L) hist roe, normal
M) File -> Save
N) File -> log -> close
Good luck!
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
7
Summary Table of the Logical Expressions in Stata
Command Short description < less than
<= less than or equal == equal > greater than
>= greater than or equal != not equal & and | or ! not
Summary Table of the Stata Commands seen in Tutorial 1
Command Short description Example
describe will show characteristics of the variable/s
contained in the dataset
des variable_name
summarize will give you summary statistics on the variables
that you enter.
sum variable_name
sort will re-arrange the variable in ascending
order.
sort variable_name
browse will show you the data in cell-format (like excel).
list can be used to find the value of a particular
variable.
list country_name happins religion if religion == “Shia
Islam” count to see how many
countries that are happier than a specific country.
count if happins > 3.32452
generate to create a variable gen variable_name = insert_formula
tabstat allows us to produce a table of statistics.
tabstat variable_name
tabstat variable_name, statistics (sum)
add up all of the values stored for a certain
variable.
tabstat variable_name, statistics (sum)
Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom
8
display In order to do calculations in Stata
display sqrt(1.05e+10/74)
histogram plot a histogram histogram variable_name, normal
corr look at the correlation between variables
corr variable_name1 variable_name2
scatter will graph a scatter plot of their relationship.
scatter variable_name1 variable_name2
tab additional information to describe variables
tab indus
STATA Tutorial #2
If you need any additional guidance, or are having other issues with STATA, try the following:
Attend office hours, the exact times of which can be found on TED.
Use the “help” command on STATA or Google (i.e. help scatter if you want clarification
on how the “scatter” command works).
Send questions to [email protected].
1. → clear
2. → cap log close
a. The “cap log close” command, in this case, tells STATA to close any log files you may
currently have open.
3. □ File > □ Log > □ Begin
a. This allows you to begin a new log (which you will need to do in order to turn in your
homework assignments). Make sure to save your log as a .log to receive full points on
your homework assignment!
4. □ File > □ Open
a. Open your dataset (wine.dta).
b. Alternatively, you could choose to use STATA’s “use” command, which also tells STATA
to load a designated dataset.
5. → save wine_out.dta, replace
a. We don’t want to actually alter the original dataset (wine.dta) so we will save it under a
new name – in this case, “wine_out.dta.”
b. The “replace” command here tells STATA to replace our previous dataset file with our
new wine_out.dta.
6. → describe
a. The “describe” command shows us what our dataset contains: the number of
observations, variables, etc. Often, it will also give a brief description of what each
variable represents.
7. → scatter alcohol heart, mlabel(country) mlabsize(vsmall)
a. We are now using the “scatter” command to create a scatterplot representing the
relationship between alcohol consumption and heart disease. Note that alcohol
consumption, listed first here, is on the Y-axis; while heart disease, listed second here, is
on the X-axis.
KEY
→ Type into Command box □ Left Click
b. The “mlabel” option allows us to label the points by country, while the “mlabsize”
option allows us to manipulate the appearance of said labels (in this case, “vsmall” tells
STATA to make the label text very small).
c. We can see, based on the scatterplot produced, that the two variables appear to be
negatively correlated such that the higher the wine consumption, the lower the deaths
by heart disease.
8. → scatter alcohol liver, mlabel(country) mlabsize(vsmall)
a. We can create a similar scatterplot to observe the relationship between alcohol
consumption and deaths by liver disease (in this case, the variables appear to be
positively correlated).
9. → regress heart alcohol, robust
a. We now want to run a regression between deaths by heart disease and wine
consumption. The “regress” command tells STATA to run a linear regression.
i. Recall that if errors are not homoscedastic, we must use heteroscedastic robust
standard errors in order to make valid inferences. We can tag on the robust
option to accommodate this.
b. STATA gives us a lot of information: in the top right corner, we can see the sample size,
the standard error, and the R-Squared. We are also told the degrees of freedom,
estimated coefficients, and standard errors, displayed in other regions of the command
output.
10. → display 46817.5108/ 107044.286
a. We can manually calculate the R-squared using the “display” command.
i. The Explained Sum of Squares (ESS) is given to us by Stata as the Model SS; the
Unexplained Sum of Squared Residuals (SSR) is given to us as the Residual SS;
and the Total Sum of Squares (TSS) is given to us as the Total SS.
ii. To calculate the R-squared, divide the ESS value by the TSS value (46817.5108/
107044.286).
11. → display 1-(60226.7749/ 107044.286)
a. Alternatively, we can calculate the R-squared using the formula 1-(SSR/TSS). Again we
can show this on STATA using the “display” command.
12. → display _b[_cons] + _b[alcohol]* 8
a. STATA stores the coefficient values in the form of the variable “_b.” Thus “_b[_cons]”
gives me the coefficient of the constant term (the intercept). Meanwhile “_b[alcohol]”
gives us the slope of the regression line.
b. To predict the value of deaths by heart disease in a country with a wine-per-capita
consumption of 8 liters per year, use the display command as shown above. We are
essentially plugging “8 [liters]” into the regression line.
13. → twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall))
a. The “twoway” command produces a twoway graph according to our specifications.
i. The “lfit” option generates a line of best fit through our original scatterplot
(initially generated in step 9).
The next two steps (16-17) are somewhat irrelevant to the tutorial as a whole but will help you in the
completion of your second homework assignment.
14. → twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall))
(function y= 253.78 - 21.733*x, range(alcohol))
a. The “function” option appended to our command back in step 15 draws a function in
the above graph – in this case, y = 253.78 - 21.733*x.
15. → twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall))
(function y= 253.78 - 21.733*x, range(alcohol)), legend(order(1 2 "Observed" 3 "A function of
interest"))
a. Here we’ll attempt to make the graph legend a little clearer. The legend option allows us
to label our graph more deliberately (to better illustrate this, try also twoway (lfit heart
alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) (function y= 253.78 -
21.733*x, range(alcohol)) and see what your key would look like in this case).
16. → predict yhat_h
a. This saves all fitted values.
b. □ Data > □ Variables Manager shows the new variable “yhat_h,” labelled “Fitted
Values.”
17. → predict uhat_h, residuals
a. Let’s also save the residuals from the regression. Again, □ Data > □ Variables Manager
should show you the new variable “uhat_h,” labelled “Residuals.”
18. → generate uhat_alt= heart - yhat_h
a. Experimentally, we can verify that the difference between the actual observed value
and the value predicted by the model equals the residual.
19. → drop uhat_alt
a. Drop the variable uhat_alt.
20. → tabstat uhat_h, statistic(sum)
a. We can check to see that the sum of the residuals equals zero using the tabstat
command, with the statistic(sum) option.
21. → rvpplot alcohol, yline(0) mlabel(country) mlabsize(vsmall)
a. Using the “rvpplot” command, we can plot the residuals. Note that the value of the
residual are shown on the vertical axis, and that level of alcohol consumption is
displayed on the horizontal axis.
22. → rvfplot, yline(0) mlabel(country) mlabsize(vsmall)
a. Let’s now instead plot the residuals against the fitted values. We observe a plot of the
residuals against the fitted values, given by the “rvfplot” command.
23. → sort uhat_h
a. Use the “sort” command to organize the residuals in ascending order (recall the “sort”
command from the first tutorial and homework assignment).
24. → list country alcohol heart yhat_h uhat_h
a. Using the “list” command, try to observe the typical size of the residuals. By observing
the residual values, we can more readily see the countries that don’t work well with the
OLS regression.
25. → regress heart alcohol if country != "Japan"
a. We can see that Japan doesn’t seem to work well with this regression model (note its
large residual). Let’s try running the regression without Japan.
b. The “if country != Japan” tells STATA to run the regression if the country’s name is not
Japan.
26. → set seed 101040
a. STATA can be used to generate a random sample of size n; suppose this random sample
is called “bsample.” In order to generate a sample we must set a “seed” value, in this
case a number. The seed can be whatever number you like; let’s here use 101040.
27. → bsample 10
a. To take our random sample, we’ll use the “bsample” command, followed by our desired
sample size. We’ll use a sample size of n=10.
28. → describe
a. Use the “describe” command to see your 10 observations.
29. → regress heart alcohol
a. Let’s run the regression again, on our 10 observations.
30. → save wine_out.dta, replace
a. Close the current dataset.
31. → clear
a. Let’s begin anew.
32. □ File > □ Open
a. We will now use the dataset with CEO salaries. Locate and open it in STATA.
33. → save ceosal2_tut2.dta, replace
34. → describe
a. Use the “describe” command to familiarize yourself with the new dataset. Observe the
variables, their descriptions, etc.
35. → regress salary ceoten
a. Let’s run a regression between predicted salary (salary) and the number of years an
individual has been a CEO (ceoten).
36. → twoway (scatter salary ceoten) (lfit salary ceoten), legend(order(1 "Observed" 2 "Fitted by
Linear Model"))
a. Use the “twoway” command to create a twoway graph that illustrates the relationship
between salary and length of CEO tenure. Note the line of best fit that appears
alongside the data points on the scatterplot.
37. → regress lsalary ceoten
a. We’ll use the “regress” command to regress the log of salary on CEO tenure.
38. → twoway (scatter lsalary ceoten) (lfit lsalary ceoten)
a. Again, let’s use the “twoway” command to create a twoway graph that shows us visually
the line of best fit through a scatterplot of the data points.
39. → Predicted_salary = exp(bo_hat + b1_hat * ceoten)
a. It is possible for us to observe this relationship using salary instead of the log of salary.
Note that if Predicted_log(salary) = b0_hat + b1_hat ceoten, then we can find a value for
the predicte salary such that Predicted_salary = exp(bo_hat + b1_hat * ceoten).
40. → twoway (scatter salary ceoten) (function y = exp(_b[_cons] + _b[ceoten]*x), range(ceoten)),
legend(order(1 "Observed" 2 "Fitted by Log Model"))
a. From here, we can now graph a twoway graph that visually expresses the relationship
between salary and CEO tenure.
41. → regress lsalary lsales
a. Let’s regress the log of salary on the log of sales. We are effectively estimating a
constant elasticity model that relates the CEO’s salary to sales generated by the firm in
millions of dollars. This relationship is modeled by log(salary) = b0 + b1 log(sales) + u.
42. → regress salary ceoten
43. → summarize
a. Recall that the “summarize” command can be used to familiarize ourselves with the
dataset: here we can use it to find values such as the average salary and tenure of a
CEO.
44. → display _b[_cons] + _b[ceoten]*7.954802
a. If we plug the average tenure of the CEO in our estimated regression, we should get
back the average salary of a CEO. We can use STATA to verify this.
45. → regress salary ceoten, robust
a. Recall that if the errors are not homoscedastic, homoscedasticity-only standard errors of
the estimators are not appropriate. If errors are not homoscedastic, then we must use
heteroscedastic robust standard errors in order to make valid inferences.
b. To tell STATA that we want heteroscedasticity-robust errors (as opposed to
heteroscedasticity-only errors, which STATA gives us by default) we tag on the “robust”
option.
46. → set seed 101040
a. Again, STATA allows us to generate a random sample of size n. Recall that to do so, we
must set a seed value, here just a numeric value. Let’s use 101040.
47. → bsample 100
a. Let’s set our sample size to 100.
48. → describe
a. The “describe” command should show you that we do in fact have 100 observations in
our dataset now.
49. → regress salary ceoten, robust
a. We can perform our last regression again, but this time with our new, reduced set of
100 observations.
50. → use CEOSAL2_tut2.DTA, clear
a. Let’s return to our old dataset.
51. → describe
a. Note that we are back to our original 177 observations.
52. → set seed 050735
a. Now we’ll take a different random sample and perform the regression again. In this
case, let’s now use a different seed value, 050735.
53. → bsample 100
54. → regress salary ceoten, robust
a. Observe that the estimated coefficients are different than those obtained before, since
we took a different random sample of size 100.
55. → save CEOSAL2_tut2.dta, replace
56. □ File > □ Log > □ Close
a. Close the log and finish!
Summary Table of the Stata Commands seen in Tutorial 2
Command Short description Example
regress performs linear regression on variables
regress depvar indepvar,option
Note: depvar: vertical axis indepvar: horizontal axis
the option robust can be used to obtain correct standard errors when
errors are heteroskedastic
twoway plots twoway graphs (scatter, line, etc);
twoway scatter variable1 variable 2
Note: when the only type of graph is scatterplot or line, “twoway” may be
omitted when inputting the command
twoway lfit adds a line of best fit to the graph
twoway (scatter variable1 variable 2) (lfit variable1 variable2)
predict obtains predictions, residuals, etc., after
estimation
predict variable, option
Note: the option residuals generates residuals
rvpplot plots the residual on the vertical axis and the
specified variable on the horizontal axis
rvpplot variable
Note: variable can be for example the x variable a regression
rvfpplot plots residual on the vertical axis and the fitted y
on the horizontal axis
rvfplot, options
Note: some examples of options are yline(), mlabel(), mlabsize()
bsample draws bootstrap samples (random samples with
replacement) from the data in memory.
bsample sample_size
Note: before inputting the command, set seed number
set seed must set seed value before generating sample
set seed number
STATA Tutorial #3
If you need any additional guidance, or are having other issues with STATA, try the following:
Attend office hours, the exact times of which can be found on TED. Use the “help” command on STATA or Google (i.e. help scatter if you want clarification on how
the “scatter” command works).
Send questions to [email protected].
--------------------------------------------------------------------------------------------------------------------------------
1. clear 2. → cap log close
a. The “cap log close” command, in this case, tells STATA to close any log files you may
currently have open.
3. cd “CURRENT DIRECTORY PATH ” The “cd” command will set the current directory in Stata. This is the directory where your data are saved and where you want the log files, graphs etc… to be saved. In order for Stata to find that folder we need to indicate a “CURRENT DIRECTORY PATH ”. To get this to work, create a folder on your desktop. In that folder create other two folders, one called “logs”, the second one called “data”. Save your data (i.e. dta files) in the “data” folder. To find out the “CURRENT DIRECTORY PATH “ , right click on either the logs or data folder. Then click on “Properties” . In the window that pops up, copy and paste the path that you find on the right of “Location” in place of the words CURRENT DIRECTORY PATH after cd. Don’t forget to keep the quotes. Example: cd “C:\Desktop\Stata Tutorial 3\” will set the current directory to be the folder called “Stata Tutorial 3” on the “Desktop” of this computer “C”.
4. log using logs\tutorial3.log, replace
a. This allows you to begin a new log (which you will need to do in order to turn in your homework assignments). Make sure to save your log as a .log to receive full points on your homework assignment! The replace option will replace any existing log file.
5. use data\vote.dta, clear
a. Begin by opening the dataset (vote1.dta). The clear option will clear the memory in Stata from any existing data file.
6. → save vote1_out.dta, replace
a. We don’t want to actually alter the original dataset (vote1.dta) so we will save it under a new name – in this case, “vote1_out.dta.”
b. The “replace” command here tells STATA to replace our previous dataset file with our new vote1_out.dta.
7. → describe
a. The “describe” command shows us what our dataset contains: the number of observations, variables, etc. Often, it will also give a brief description of what each variable represents.
8. generate id=_n
a. Let’s generate and id each observation, using this command we now have the observations numbered.
9. browse a. Notice how there's a new variable (last column), the one you just generated (id). Also
notice the units in which the variables are: for example, voteA and prtystr are in percentage points, so a value of 43 for voteA means that candidate A got 43% of the votes.
10. reg voteA expendA expendB, robust a. Let’s start by regressing the percentage vote received by the incumbent, and the
campaign expenditures incurred by each candidate. b. In the top right corner, you will find, among others, the overall F-statistic (test of the
joint hypothesis that all the slope coefficients are zero), the R-squared and what we call SER (standard error of the regression), which STATA calls Root MSE (mean squared error). In the following table, you find the 3 estimates of the coefficients, the robust standard errors and the t-statistics (test the hypothesis that each individual coefficient is zero).
Now, let’s interpret the meaning of the estimated regression coefficients.
i. When expenditures for both parties are 0, the percentage of votes received by
candidate A (the incumbent) is predicted to be 49.6 percentage points, on average.
ii. An increase in expenditures by candidate A of $1000 is predicted to increase, on
average, his/her total vote by 0.38 percentage points, keeping candidate B's (the
challenger) expenditures constant.
iii. For each $1000 increase in expenditures by candidate B, candidate A will lose, on
average, about .036 percentage points, when candidate A's expenditures are held
constant. c.
11. display _b[cons] + _b[expendB]*2+_b[expendA] a. Use the command to show the estimated increase in the percentage of votes for $1000
more expendA when expendB=2
12. test expendA expendB a. To test the hypothesis that both coefficients are equal to zero
13. test expendA
a. To test the hypothesis that the coefficient on expendA is different from 0 we can use the command test as show above.
b. Being the P-value smaller than 0.01, we reject the null hypothesis
14. test (expendA=1) (expendB=0) a. We use this to test the joint hypothesis that the coefficient on expendA is equal to 1 and
that the coefficient on expendB equals 0.
To comment on the fit of the model, notice that both slope coefficients are highly
significant and the R-squared demonstrates that this model explains about 53% of the
variance of vote share.
i. The SER (Root MSE) indicates that the typical deviation from the predicted value of
each electoral district is about 11.6 percentage points, but this number is hard to
evaluate in isolation.
In short, this is a reasonably good fit for a model.
15. sum expendA expendB display _b[_cons]+ _b[expendA]* 310.611 + _b[expendB]*305.0885
a. To predict the fraction of votes for candidate A at the average expenditure of A and expenditure B, first find out the average of expendA and expendB using the command sum (above)
b. thus multiply the coefficient of each variable by the average found in point a
16. sum expendA a. We can see what happens to percent vote for the incumbent if incumbent campaign
spending increased by one standard deviation, while the challenger's expenditures remains fixed
17. display _b[expendA]* 280.9854 a. Multiply the coefficient for expendA by its standard deviation b. All else equal, a one standard deviation increase in expenditures by the incumbent
would lead to an increase in vote share in about 10.8 percentage points.
18. gen lnvoteA=log(voteA) gen lnexpendA=log(expendA) reg lnvoteA lnexpendA expendB, robust
a. Suppose you want to know the percentage change in voteA for a 1% change in expendA. You can directly obtain this result from the regression by running a log regression. Keeping expenditure for candidate B constant, a 1% increase in expenditure for candidate A corresponds to a 0.17% increase in the percentage of votes received by candidate A.
19. generate expendA_sq= expendA^2
reg voteA expendA expendA_sq, robust
a. Imagine you are the adviser for an incumbent candidate. You come across with a
theory that there are diminishing marginal returns to campaign expenditures by
incumbent candidates.
b. You want to test this theory, so you decide to model the relationship between
percent vote and expenditures for the incumbents as a quadratic function.
i. What does the regression results show you?
ii. There appear to be diminishing marginal returns to expenditures. Notice
that the coefficient on the squared value of incumbent expenditures is
negative.
iii. This indicates that each new increase in expenditures will yield less new
returns than the value before. Eventually, we will reach a point where
increasing expenditures actually cost an incumbent votes. How do you
explain this turn around point?
iv. A possible explanation is that airwaves become fully saturated and over-
exposure leads voters in a particular district to turn against the candidate.
20. twoway (scatter voteA expendA) (qfit voteA expendA), legend(order(1 2
"Quadratic Fit"))
a. We plot the estimated relation.
b. Scatter shows you the points in your sample, qfit plots the estimated quadratic
relationship
21. twoway (scatter voteA expendA) (qfit voteA expendA) (lfit voteA expendA),
legend(order(1 2 "Quadratic Fit" 3 "Linear Fit")
a. In this graph, we compare the quadratic fit with the linear fit.
To test the theory, beyond visual comparison of the two fits, we can formally test the
hypothesis that the relationship between voteA and expendA is linear, against the
alternative that it is nonlinear. If the relationship is linear, the coefficient on expendA_sq
is zero. The t-statistic for this test is -6, thus we reject the null hypothesis. There is
evidence that the relationship is nonlinear
22. display (_b[_cons]+ _b[expendA]*110+_b[expendA_sq]*110^2) -
(_b[_cons]+_b[expendA]*100+_b[expendA_sq]*100^2)
23. display (_b[_cons]+_b[expendA]*510+_b[expendA_sq]*510^2) -
(_b[_cons]+_b[expendA]*500+_b[expendA_sq]*500^2)
a. To show that there are diminishing marginal returns to campaign expenditures, we
compute the effect of increasing campaign expenditure by $10,000, when
spending is $100,000 and when spending is $500,000
i. Adding an additional $1000 in spending after having already spent
$100,000 will lead to an additional 0.69 percentage points in voting for
candidate A.
ii. But, adding an additional $1000 in spending after having already spent
$500,000 will only lead to an additional 0.23 percentage points in voting
for candidate A.
24. count if expendA > 700
a. The visual analysis of the scatter plot reveals that there is a turning point at
around $700,000 in spending. We want to see if there are a lot of districts with
incumbent expenditures over $700,000.
25. list id state district expendA if expendA > 700
a. To know which are those districts, you can use the list command.
26. gen shareA_dummy=(shareA>50)
gen voteA_dummy=(voteA>50)
tab shareA_dummy
tab voteA_dummy
reg voteA_dummy shareA_dummy, robust
a. Suppose candidate A wants to know: what's the effect of spending more than
candidate B on the probability of getting more than 50% of the votes. You can
find that out generating the variables above.
b. Having higher expenditure increases the probability of having the majority of
votes by (0.84*100) percentage points.
27. reg voteA expendA expendA_sq expendB prtystrA, robust
a. There is other factors besides just incumbent spending that influence votes. Vote
share of the incumbent is also affected by the opponent's spending (expendB) and
the strength of your own party (prtystrA). We run a regression controlling for
those factors.
b. All coefficients are significantly different from zero, at the 1% significance level.
There are still diminishing marginal returns to incumbent campaign expenditure.
c. With other variables held constant, an increase of $1000 in the opponent's
spending, will cost the incumbent -0.03 percentage points of the vote share.
d. An increase in the strength of the incumbent's party of 1 percentage point,
keeping all other variables constant, will yield 0.32 percentage point increase in
the incumbent's vote share.
e. With this model we have now explained 65% of the variation in the vote share of
the incumbent. More importantly, we have reduced the SER, which indicates that
we are starting to achieve a relatively good fit
28. sum expendA expendB prtystrA
29. display
_b[_cons]+_b[expendA]*310.611+_b[expendA_sq]*( 310.611^2)+_b[expendB]*305.0
885+_b[prtystrA]*65 a. You want to predict the incumbent share of the vote, if party strength were 65
percent, and the candidates kept their expenditures at their mean levels.
b. About 58.46% of the vote
30. reg voteA lexpendA, robust a. In general, when you want to do a regression with a variable in logarithm form,
you have to generate that variable, by writting for example, generate
ln_expendA=ln(expendA). In this case, the log of campaign expenditures for each
candidate are already variables in this dataset, so we don't need to generate them.
b. The coefficient in is highly significant and indicates that the 1% increase in
expenditure, would yield an increase in vote share of (6.51/100)=0.0651
percentage points.
31. twoway
(scatter voteA lexpendA) (lfit voteA lexpendA), legend(order(1 "Actual Values" 2 "Fitted
Values"))
a. Plot the relationship between voteA and log(expendA) and the fitted line.
32. reg voteA lexpendA lexpendB prtystrA, robust
a. Now, we keep the linear-log specification but, fearing omitted variable bias, we
add control variables log(expendB) and prtystrA.
b. Interpretation of results: A 1% increase in incumbent expenditures leads to an
increase in incumbent vote share in the amount of 0.608 percentage points,
keeping all other variables constant.
c. A 1% increase in challenger expenditures leads to a reduction in incumbent vote
share of 0.662 percentage points, keeping all other variables constant.
d. An increase in the incumbent's party strength of 1 percentage point, leads to an
increase in incumbent vote share of 0.15 percentage points, keeping all other
variables constant.
e. We are confident with the results of this model. All variables are highly
significant. We have explained 79% of the variation in incumbent vote share and
the SER has been reduced to only 7.7 percentage points
33. display_b[_cons]+_b[lexpendA]*(ln(400))+_b[lexpendB]*(ln(500))+_b[prtystrA]*
50
a. Compute the predicted vote share for your candidate if his/her expenditures are
$400,000 and the opponents are $500,000 and the incumbent's party strength is
50%
34. display_b[_cons]+_b[lexpendA]*(ln(600))+_b[lexpendB]*(ln(500))+_b[prtystrA]*
50
a. Compute what happens if your candidate increases expenditures to $600,000,
keeping the other variables constant.
35. display _b[lexpendA]*(ln(600)-ln(400))
a. The increase in your candidates' vote share would be 2.47 percentage points, from
48.01 to 50.48 percent. You can compute this increase directly by using the
command above.
36. save vote1_out.dta, replace clear
a. Close this dataset.
37. log close
a. Close the log.
Summary Table of the Stata Commands seen in Tutorial 3
Command Short description Example
regress Running a linear regression on multiple variables
running a log regression on
multiple variables
reg voteA expendA expendB, robust
reg lnvoteA lnexpendA expendB, robust
test To test the hypothesis that the coefficient is different
from 0
To test the joint hypothesis that the coefficient on
variable one is different from 1 and that the
coefficient on variable 2 is different from 0
test expendA
test (variable1=1) (expendB=0)
twoway To plot the estimated relation between two
variables
twoway (scatter voteA expendA)
(qfit voteA expendA) (lfit voteA
expendA), legend(order(1 2
"Quadratic Fit" 3 "Linear Fit")
generate To generate dummy variables
To generate and id for each
observation
gen shareA_dummy=(shareA>50)
generate id=_n
count To see how many districts are over a particular value
count if expendA > 700
list To show the name of the those districts that are over
the particular value
list id state district expendA if expendA > 700