1 | P a g e
Disclaimer
We would like to acknowledge that in preparing this tutorial we have used materials from
various sources listed in the bibliography. This tutorial will be used for the lab exercise of the
course “HPRO 805: Applied Research in Public Health”.
2 | P a g e
Table of Contents Section 1: Getting Started ..................................................................................................................4
1.1 Convention: ......................................................................................................................................... 4
1.2 Executing a Command ........................................................................................................................ 4
1.3 Working along with the tutorial: ......................................................................................................... 4
1.4 The Stata screen .................................................................................................................................. 5
Using an existing dataset ...................................................................................................................... 6
1.5 An example of a short Stata session ................................................................................................... 7
1.6 Quick Review ....................................................................................................................................... 9
1.7 Exercises .............................................................................................................................................. 9
Section 2: Entering data ................................................................................................................... 11
2.1 Introduction ...................................................................................................................................... 11
2.2 Creating a dataset ............................................................................................................................. 11
2.3 An example of questionnaire ............................................................................................................ 12
2.4 Developing a coding system ............................................................................................................. 14
2.5 Entering data ..................................................................................................................................... 17
2.6 Saving your dataset ........................................................................................................................... 19
2.7 Checking the data ............................................................................................................................. 20
2.8 Quick Review ..................................................................................................................................... 21
2.9 Exercises ............................................................................................................................................ 21
Section 3: Preparing for data analysis ............................................................................................... 22
3.1 Introduction ...................................................................................................................................... 22
3.2 Planning your work ........................................................................................................................... 22
3.3 Loading data and inputting commands ............................................................................................ 23
3.4 Looking at data .................................................................................................................................. 24
3.5 Creating variable and value labels .................................................................................................... 24
3.6 Creating and modifying variables ..................................................................................................... 27
3.6.1 Generate and Replace ............................................................................................................ 27
3.6.2 Renaming variables ................................................................................................................ 29
3.6.3 Recoding variables .................................................................................................................. 29
3.7 Creating Scales .................................................................................................................................. 30
3.8 Quick Review ..................................................................................................................................... 30
3 | P a g e
3.9 Exercises ............................................................................................................................................ 31
Section 4: Working with commands, do-files, and results .................................................................. 32
4.1 Introduction ...................................................................................................................................... 32
4.2 How Stata commands are constructed ............................................................................................. 32
4.3 Creating and Saving a do-file ............................................................................................................ 35
4.4 Copying your results to a word processor ........................................................................................ 39
4.5 Quick Review ..................................................................................................................................... 39
4.6 Exercises ............................................................................................................................................ 39
Section 5: Descriptive Statistics and Graphs ...................................................................................... 40
5.1 Introduction ...................................................................................................................................... 40
5.2 Where is the center of a distribution? .............................................................................................. 41
5.3 How dispersed is the distribution? ................................................................................................... 41
5.4 Statistics and graphs – Unordered categories .................................................................................. 43
5.5 Statistics and graphs- Ordered categories and variables .................................................................. 49
5.6 Statistics and graphs-quantitative variables ..................................................................................... 51
5.7 Cross-tabulation for two categorical variables ................................................................................. 55
5.8 Chi-squared test ................................................................................................................................ 56
5.9 Quick Review ..................................................................................................................................... 58
5.10 Exercises .......................................................................................................................................... 58
Bibliography .................................................................................................................................... 60
4 | P a g e
Section 1: Getting Started
1.1 Convention:
Listed below are the conventions that are used throughout this tutorial.
Typewriter font. This tutorial will use this font when something would be meaningful to
Stata as input. It will also be used in the tutorial to indicate Stata’s output.
This tutorial will use a typewriter font to indicate the text to type in the Command window.
Because Stata commands do not have any special characters at the end, any punctuation mark
at the end of a command in this tutorial is not part of the command.
This tutorial also uses the typewriter font for variable names and for names of datasets. In
general, this tutorial uses the typewriter font whenever the text is something that can be typed
in Stata Command window or when the text is something that Stata might print as output.
1.2 Executing a Command
After you type a command you need to execute the command. Press ENTER on your
keyboard. We may not mention “press ENTER” in this tutorial after every command;
however, you have to press ENTER to execute that command.
1.3 Working along with the tutorial:
We cannot say it too often: the only way to learn how to analyze data is to analyze data
yourself. We strongly recommend that you reproduce our examples in Stata as you read this
tutorial. A line that is written in this font and begins with a period represents a Stata
command, and we encourage you to enter that command in Stata. Typing the commands and
seeing the results will help you better understand the text, since we sometimes omit output to
save space.
What does the dot prompt before a command mean?
When we show a listing of Stata commands, we place a dot and a space in front of each
command. When you enter these commands in the Command window, you enter the
command itself without the dot prompt or the space before the command. We include
these because Stata always shows commands this way in the Result window.
5 | P a g e
1.4 The Stata screen
Figure 1.4.1. The Stata Screen
When you open a file that contains Stata data, which we will call a Stata dataset, a list of
variables will appear in the Variable window. The Variable window reports the name of the
variables (for example, abortion), a label for the variable (for example, Attitude toward
abortion), the type of variable (for example, float), and the format of the variable (for
example, %8.0g).
When Stata executes a command, it prints the results or output in the Results window. First,
it prints the command preceded by a . (dot) prompt, and then it prints the output. The
commands you run are also listed in the Review window. If you click on one of the
commands listed in the Review window, it will appear in the Command window. If you
double-click on one of the command listed in the Review window, it will be executed. You
will then see the commands and its output, if any, in the Results window.
6 | P a g e
Using an existing dataset
Section 2 discusses how to create your own dataset, save it, and use it again. We will also
upload datasets for the class on blackboard and we recommend that you create a new folder
called “Stata” in C:\ and save those datasets in that folder. Therefore the location of your
datasets will be C:\Stata. For now, we will use a simple dataset, cancer.dta posted on the
blackboard. Download the file cancer.dta and save it in C:\Stata. Click once in the
Command window to put the cursor there, and then type the command use
“C:\Stata\cancer.dta”, clear. The Command window should look like the one in
figure 1.5.1. Then press ENTER on your keyboard.
Figure 1.5.1. Stata command to open cancer.dta
The command use filename, clear reads a previously saved dataset. If filename is
specified without an extension, Stata assumes it to be .dta. If your filename contains
embedded spaces, remember to enclose it in double quotes. Now that we have some data read
into Stata, type describe in the Command window and press ENTER. The command
describe will produce a brief description of the contents of the dataset.
. describe Contains data from C:\Stata\cancer.dta obs: 48 Patient Survival in Drug Trial vars: 8 21 Oct 2010 10:23 size: 768 (99.9% of memory free) storage display value variable name type format label variable label studytime int %8.0g Months to death or end of exp. died int %8.0g 1 if patient died drug int %8.0g Drug type (1=placebo) age int %8.0g Patient's age at start of exp. Sorted by:
The description includes a lot of information: the full name of the file, cancer.dta
(including the path used to locate the file); the number of observations (48); the number of
variables (8); the amount of memory the data consume (768 bytes) and how much of Stata’s
memory is still available (99.9% of memory free); a brief description of the dataset
(Patient Survival in Drug Trial); and the date the file was last saved (21 Oct 2010
10:23). The body of the table displayed shows the names of the variables on the far left and
the labels attached to them on the far right. We will discuss the middle columns later.
Now that you have opened cancer.dta, note that the Variables window lists the four
variables studytime, died, drug, and age.
7 | P a g e
1.5 An example of a short Stata session
If you do not have cancer.dta loaded, type the command use “c:\Stata\cancer.dta”,
clear. We will execute a basic Stata analysis command. Type summarize in the Command
window and then press ENTER.
In the Results window, the summarize command will display the number of observations
(Obs, also called cases or N), the mean, the standard deviation, minimum value, and the
maximum value for each variable. The output from summarize is useful when you have
continuous variables and not categorical. Example, mean and standard deviation of gender
does not make sense.
. summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- studytime | 48 15.5 10.25629 1 39 died | 48 .6458333 .4833211 0 1 drug | 48 1.875 .8410986 1 3 age | 48 55.875 5.659205 47 67
The first line of output displays the dot prompt followed by the command. After that, the
output appears as a table. As you can see, there are 48 observations in this dataset.
Observations is a generic term. These could be called participants, patients, subjects,
organizations, cities, or countries depending on your field of study. In Stata, each row of data
in a dataset is called an observation. The average, or mean, age is 55.875 years with a
standard deviation of 5.659, and the subjects are all between 47 (the minimum) and 67 (the
maximum) years old.
You might want to select specific variables to summarize instead of summarizing them all.
You can do this either by typing the name of the variables following summarize or just type
summarize and click on the variable name in the Variables window. For example, typing
summarize studytime age will display only statistics for the two variables named
studytime and age.
We will do one more thing in this Stata session: we will make the histogram for the age
variable, shown in figure 3.
Copying and Pasting your Output from Stata Result window to Word document
You can copy the output from the Result window. For this you need to highlight (select)
the output you need to copy, right click and select Copy Text. Then paste it into your
word file by using Ctrl+v. You will probably need to change the font into Courier
New and reduce its font size (for example, to 9 points) after pasting it to prevent the
lines from wrapping.
8 | P a g e
Figure 3. Histogram for age
A histogram is just a graph that shows the distribution of a variable, such as age, that takes on
many values. Simple graphs are simple to create. Just type the command histogram age in
the Command window, and Stata will produce a histogram using reasonable assumptions.
At first glance, you may be happy with this graph. Stat used a formula to determine that six
bars should be displayed, and this is reasonable. However, Stata starts the lowest bar (called a
bin) at 47 years old, and each bin is 3.33 years wide (this information is displayed in the
Results window) even though we are not accustomed to measuring years in third of a year.
Also notice that the vertical axis measure density, but we might prefer that it measure
frequency, that is the number of people represented by each bar.
Below you will find a command to improvise the histogram (FIGURE 4). In reading this
command you will want to ignore the opening ‘dot’ (Stata prints this in front of commands in
the Results window, but the ‘dot’ is not part of the command and you do not type it). Stata
prints > sign at the start of the second and third line, which might be confusing. Stata uses the
ENTER key to submit a command. Because of this, Stata sees the entire command as one
line. To print the entire line in confines of the Results window, Stata inserts the > for a line
break. If you wanted to enter this command in the Command window, simply type the entire
thing without the > and let Stata do the wrapping as need in the Command window. Never
press ENTER key until you have entered the entire command.
. histogram age, width(2.2) start(45) frequency title(Age Distribution of > Participants in Cancer Study) note(Data: Sample cancer dataset) > legend(on) scheme(s1mono)
0
.02
.04
.06
.08
De
nsity
45 50 55 60 65Patient's age at start of exp.
9 | P a g e
Figure 4. Improved Histogram for age
If you do want to enter a long command in the Command window, remember to type it as
one line. Whenever you press ENETR, Stata assumes that you have finished the command
and are ready to submit it for processing.
To finish our Stata session, we need to close Stata. Type exit in the Command window and
press ENTER.
1.6 Quick Review
In this section we covered:
The font and punctuation conventions that will be used throughout this tutorial
The Stata windows
How to open a sample Stata dataset
How to summarize the variables
How to create and modify a simple histogram
1.7 Exercises
1. For this exercise you need a dataset “nhanes.dta” that is posted on Blackboard. Save
nhanes.dta in the folder c:\Stata. Open nhanes.dta. Run two commands, describe
and summarize and answer following questions. [Hint: To continue with more results in
the Results window press SPACEBAR for the next page of the result or press any other
keys, except “q” for the next line of the result. Pressing “q” will quit the results.]
05
10
Fre
qu
en
cy
45 50 55 60 65Patient's age at start of exp.
Frequency
Data: Sample cancer dataset
Age Distribution of Participants in Cancer Study
10 | P a g e
a. How many observations and variables are there in the dataset nhanes.dta?
b. What are the names of the variables whose labels are Marital Status, Annual
household income, and Smoked at least 100 cigarettes in life?
c. What are the variable labels for the variables riagendr, smd630, bmxwt, and
bmxht?
d. What is the mean age of the respondent? (HINT: You can get the mean age of the
respondent using the variable ridageyr). Could you tell us the range of the age in
the distribution (Note: Range=Maximum observation-Minimum Observation)
e. Can you get the mean of the variable riagendr? Explain.
f. Show the number of observations, the mean, the standard deviation, minimum
value, and the maximum value for the two variables: bmxwt and bmxht
11 | P a g e
Section 2: Entering data
2.1 Introduction
This section shows how to create a dataset and enter data into the dataset.
2.2 Creating a dataset
In this section, you will learn how to create a dataset. Data entry and verification can be
tedious, but these tasks are essential for you to get accurate data.
Stata almost always requires data that are set in a grid, like a table, with rows and columns.
Each row contains data for an observation (which is often a subject or a participant in a
study), and each column contains the measurement of some variable of interest about the
observation, for example, age, race, gender, etc.
There is more to a dataset than just a table of numbers. Datasets usually contain labels that
help the researcher use the data more easily and efficiently. In a dataset, the columns
correspond to variables, and variables must be named. We can also attach to each variable a
descriptive label (e.g. smoking status), which will often appear in the output of statistical
procedures. Since most data will be numbers, we can also attach value labels to the numbers
to clarify what the numbers mean (e.g. “smokers” for 1 and “non-smokers” 0).
It is extremely helpful to pick descriptive names for each item. For example, you might call a
variable that contains people’s responses to a question about their state’s schools q23, but it
is often better to use a more descriptive name, such as schools. If there were two questions
Variables and Items
In working with questionnaire data, an item from the questionnaire almost always
corresponds to a variable in the dataset. So if you ask for a respondent’s age, then you
will have a variable called age. Some questionnaire items are designed to be combined
to make scales or composite measures of some sort, and new variables will be created
to contain those items (e.g. bmi), but there is no single item corresponding to the scale
or construct on the questionnaire. The terms item and variables are often used
interchangeably, but they are not synonyms. An item always refers to one question. A
variable may be the score on an item or a composite score based on several items.
12 | P a g e
about schools, you might want to name them as schools1 and schools2 rather than q23a
and q23b. If a set of items are to be combined into scales and are not intended to be used
alone, you might want to use names that correspond to the questionnaire items for the
original variables and reserve the descriptive names for the composite scores. This is useful
with complex datasets where several people will be using the same dataset. Each user will
know that q23a refers to question 23a whereas descriptive names like “schools1” may make
sense to one user but not to other user. No matter what the logic of your naming, try to keep
names short for you to save time while typing the variable several times during your analysis.
The name age is more preferred than age_of_respondents. In addition, some of Stata’s
tabular output is designed for short names and the longer names will be truncated.
Even with relatively descriptive variable names, it is usually helpful to attach a longer and
more descriptive label to each variable. We call these variable labels to distinguish them
from variable names. For example, you might want to label the variable smokestatus with
the label “Current smoking status”. The variable label gives us a clearer understanding of the
data stored in that variable.
For some variables, the meaning of the values in the data is obvious. If you measure people’s
height in meter, then when you see the values; it is clear what they mean. For other variables,
the meaning of the values needs to be specified with value labels. For example, responses
“Yes”, “No” are coded as 1 and 0 respectively, and it can make understanding tables of data
much easier if you create value labels to be displayed in the output in addition to (or instead
of ) the numbers.
2.3 An example of questionnaire
We have discussed datasets in general, so now let’s create one. Suppose that we conducted a
telephone survey of 20 people and asked each of them nine questions, which are shown in the
example questionnaire below. Our task is to convert the questionnaire’s answers into a
dataset that we can use with Stata.
13 | P a g e
Example Questionnaire:
Tobacco Survey
1. What is your gender?
___Female ___Male
2. What is your current age in years?
______
3. How many years of education have you completed?
___ 0-8 ___ 9-10
___11-12 ___13-16
___16-19 ___20+
4. Have you smoked at least 100 cigarettes in your entire lifetime?
___Yes ___No (if “No” thank the respondent and end the interview)
5. How old were you when you first started smoking cigarettes fairly regularly?
____ Years (if never smoked regularly enter “X” and GO TO 6)
(Note: “X” will be treated as missing in data entry process)
6. Do you now smoke cigarettes every day, some days, or not at all?
___ Every day ___ Some days ___ Not at all ___No response
(If “Every day” then GO TO 7, ELSE thank the respondent and end the interview)
7. On the average, about how many cigarettes do you now smoke each day?
____ Number of cigarettes
8. How soon after you wake up do you typically smoke your first cigarette of the day?
___ Hours ___Minutes
9. How many of your friends smoke?
___ None
___1-5 friends
___ 6-10 friends
___>10 friends
___Don’t Know
___Refused to answer
14 | P a g e
2.4 Developing a coding system
Statistics is most often done with numbers, so we need to have a numeric coding system for
the answers to the questions. Stata can use numbers or words. For example, we could type
Female if a respondent checked it. However, it is usually better to use some sort of numeric
coding, so you might type 1 if the respondent checked the option Male on the questionnaire
and 2 if the respondent checked Female. We will need to assign a number to enter for each
possible response for each of the items on the survey.
You will also need a short variable name for the variable that will contain the data for each
item. Variable names can contain uppercase and lowercase letters, numerals, and the
underscore character, and they can be up to 32 characters long. No blank spaces are allowed
for variable names. The variable name mother age would be interpreted as two variables,
mother and age. Generally you should keep your variable names to 10 characters or fewer,
but 8 or fewer is best. Variable names should start with a letter.
If appropriate you should explain the relationship between any numeric codes and the
responses as they appeared on the questionnaire. For an example, see the example codebook
(not to be confused with the Stata command codebook, which we will use later) for our
questionnaire that appears in table 2.4.1.
15 | P a g e
Table 2.4.1. Example codebook
Question Variable
name
Variable
labels
Value labels Code
Identification number id Record in order ENTER ID# 1 to 20
What is your gender? gender Gender of respondents
Male 1 Female 2
What is your current age in
years?
age Age of respondents
ENTER AGE IN YEARS
How many years of
education have you
completed?
education Highest level of educational attainment
0-8 8 9-19 9 to 19 20+ 20
No response -9
Have you smoked at least
100 cigarettes in your entire
lifetime?
smoked100 Smoked at least 100 cigarettes in life
Yes 1 No 0
No response -9
How old were you when
you first started smoking
cigarettes fairly regularly?
initiation Age at initiation
ENTER AGE IN YEARS
No response -9
Do you now smoke
cigarettes every day, some
days, or not at all?
smokestatus Current smoking status
Every day 1 Some days 2 Not at all 3 No response -9
On the average, about how
many cigarettes do you now
smoke each day?
numcig Number of cigarettes per day
ENTER NUMBER
No response -9
How soon after you wake
up do you typically smoke
your first cigarette of the
day?
firstcig Time to first cigarettes of a day (in minutes).
ENTER TIME in MINUTES
No response -9
How many of your friends
smoke?
peersmoke Number of friends who smoke
None 0 1-5 friends 1 6-10 friends 2 >10 friends 3 No response -9
A codebook translates the numeric codes in your dataset back into the questions you asked
your participants and the choices you gave them for answers. Regardless of whether you
gather data with a computer-assisted interviewing system or with paper questionnaires, the
codebook helps to make sense of your data and your analyses. If you do not have a
codebook, you might not realize that everyone with eight or fewer years of education is
coded the same way, with an 8. That may have an impact on how you use that variable in
later analyses.
We have added an id variable to identify each respondent. In simple cases, we just number
the questionnaire sequentially. If we have a sample of 5,000 people, we will number the
questionnaires from 1 to 5,000, write the identification number on the original questionnaire,
and record it in the dataset. If we discover a problem in our dataset (e.g. somebody with a
16 | P a g e
coded value of 3 for gender), we can refer back to the questionnaire and determine the
correct value.
Some data will be missing -- people may refuse to answer certain questions, interviewers
forget to ask questions, equipment fails to record a measurement, etc. and we need a code to
indicate that the data is missing. If we know why the response is missing, we will record the
reason too, so we may want to use different codes that correspond to the different reasons the
data are missing.
On surveys, respondents may refuse to answer, may not express an opinion, or may not have
been asked a particular question because of the answer to an earlier question. Here we might
code “refused to answer” as -4, “don’t know” as -3, “valid skip” as -2, and “missing for any
other reason” as -1. We should pick values that can never be a valid response. In this session
we will leave the cell “blank” for the missing values.
We will be entering the data ourselves, so after we administered the questionnaire to our
sample of 20 people, we prepared a coding sheet that will be used to enter the data. The
coding sheet originates from the days when data were entered by professional keypunch
operators, but it can still be useful. When you create a coding sheet, you are converting the
data from the format used on the questionnaire to the format that will actually be stored in the
computer (a table of numbers). The more the format of the questionnaire differs from a table
of numbers, the more likely it is that a coding sheet will help prevent errors.
In general, if you transcribe the data from the questionnaires to the coding sheet, you will
need to decide which responses go in which columns. Deciding this will reduce errors from
those who perform the data entry, who may not have the information needed to make those
decisions properly.
Whether you enter the data directly from the questionnaires or create a coding sheet will
depend largely on the study and on the resources that are available to you. Some experience
with data entry is valuable because it will give you a better sense of the problems you may
encounter, whether you or someone else enters the data; so, for our example questionnaire,
we have created a coding sheet shown in table 2.4.2
17 | P a g e
Table 2.4.2. Example coding sheet id gender age education smoked100 initiation smokestatus numcig firstcig peersmoke 1 2 25 8 1 11 1 15 15 3 2 1 27 9 0 3 2 29 8 0 4 1 35 12 1 12 1 12 30 2 5 2 54 9 1 12 1 15 60 2 6 2 43 8 1 10 2 7 2 47 16 1 10 1 7 120 0 8 1 26 8 0 9 1 19 10 1 9 3 10 1 20 9 1 10 1 14 10 2 11 2 19 11 1 11 1 13 10 3 12 2 17 12 1 10 2 13 2 26 15 0 14 1 24 15 1 9 1 10 90 2 15 1 23 16 1 11 1 11 180 2 16 1 22 10 1 12 1 3 240 1 17 2 21 13 1 13 1 9 60 1 18 1 20 13 1 12 3 19 2 41 14 1 15 2 20 1 38 19 1 16 1 20 5 3
Because we are not reproducing the 20 questionnaires in this tutorial, it may be helpful to
examine how we entered the data from one of them. We will use the 5th questionnaire. We
have assigned an id of 5, as shown in the first column. Reading from left to right, in the
second column, for gender, we have recorded a 2 to indicate a woman. The woman is 54
years old with 9 years of education. She has smoked at least 100 cigarettes in her lifetime
(smoked100=1). She started smoking fairly regularly from the age of 12 and is a dailysmoker
(smokestatus=1). In a day she smokes around 15 cigarettes and 1 hour (60 minutes) after
she wakes up in the morning she smokes her first cigarette. She has 6-10 friends who are
smokers (peersmoke=2). [Note: Refer to Table 2.4.1 for the codes of particular value labels.]
2.5 Entering data
We will use Stata’s Data Editor to enter our data from the coding sheet in table 2.4.2. The
Data Editor provides an interface similar to that of a spreadsheet, but it has some features that
are particularly suited to creating Stata datasets. Before opening the Data Editor, you might
want to save any open files and then enter the command clear in the command window.
This step will give you a fresh Data Editor in which to enter data. Enter the clear command
only if you want to start with a new dataset that has nothing in it. Type edit in the Command
Window. The Data Editor window is shown in figure 2.5.1.
18 | P a g e
Figure 2.5.1 Data Editor Window
Data are entered in the white columns, just as in other spreadsheets. Here we will only enter
the data for the first respondent to the example survey. In the first white cell under the first
column, enter the identification number of the first respondent, which is a 1. Press the Tab
key to move across to the next cell, and enter the value from the second column of the coding
sheet, which is a 2 because the first case is woman. Keep entering the values and pressing
Tab until we have entered all the values for the first participant. After we have entered the
number from the last number in the first row of the coding sheet, press Enter, which will
move the cursor to the second row of the Data Editor. Press the Home key to move to the
first column of this row.
After we have the data entered for the first respondent, let us modify the current variable
names (var1, var2,…) according to our codebook. Double-click on the gray cell at the top
of the first column, which now contains the variable name var1. Double-clicking on this cell
opens the variable properties dialog box. Figure 2.5.2 shows how we can use this dialog box
to change the variable name to id and the label to Record in order. Use TAB key to jump
from ‘Name’ to ‘Label’ boxes.
Figure 2.5.2 Variable name and variable label
19 | P a g e
Click Apply and then in the Data Editor Window click Var2 and name it as gender and its
label as Gender of Respondents. Click Apply and continue renaming and labeling all the
remaining generic variable names to rename and label them as listed in table 2.4.1. Close the
dialog box.
Enter values of all the variables for the remaining respondents using the codebook 2.4.2.
Click the Save icon and close the Data Editor window.
All our data here are numeric, none of the data contain decimals, and none are wider than
eight digits, so we can leave the format alone. Once you have defined a variable as numeric,
Stata will warn you if you try to enter data that are not numeric, which can help reduce
errors.
2.6 Saving your dataset
If you look at the Results Window in Stata, you can see that the dialog box has done a lot of
work for you. The Results window shows that a lot of commands have been run to rename
Editing Variables
Using the data editor’s variable properties dialog box we can modify the variables that
we are interested in. Type edit in the Command window and run it. We will be
prompted to Data Editor window. Double click on the variable we want to modify and
we can change the name of the variable and re-label it. Close the dialog box and Click
Save icon in the Data Editor window and close it.
Recoding Variables
For example, we are interested to recode age variable into 3 groups: 17-29 years, 30-
39 years, and 40+ years. For this we need to assign codes 1, 2, and 3 for each of these
age groups under the new variable agegroup in our coding sheet. The observations
with id 1, 2, 3, 4, … would have a recoded age under the agegroup respectively as 1, 1,
1, 2, … . This will be our modified coding sheet. To create a new variable as a recode
of an existing variable open the Data Editor window and enter the recoded value of the
first observation in an empty cell where we want our new recoded variable. Then
double click the new variable created and modify its Name and Label. Close the
variable properties dialog box and complete the entries under the new variable for all
the observations from our modified coding sheet. Click Save icon on the Data Editor
window. Close the Data Editor window. You will learn more efficient way to do this in
later section.
20 | P a g e
and label the variables. These also appear in the Review window. The Variables window lists
all of your variables. We have now created our first dataset. Let’s save this dataset by the file
name firstsurvey:
. save “C:\Stata\firstsurvey.dta”, replace
2.7 Checking the data
We have created a dataset, and now we want to check the work we did defining the dataset.
Checking for the accuracy of our data entry is also our first statistical look at the data. To
open the dataset type in the Command window
. use “C:\Stata\firstsurvey.dta”, clear
Let us run a couple of commands that will characterize the dataset and the data stored in it in
slightly different ways. We created a codebook to use when creating our dataset. Stata
provides a convenient way to reproduce much of that information, which is useful if you
want to check that you entered the information correctly:
. codebook gender -------------------------------------------------------------------------- gender Gender of respondents -------------------------------------------------------------------------- type: numeric (byte)
range: [1,2] units: 1 unique values: 2 missing .: 0/20
tabulation: Freq. Value 10 1 10 2
Let us go over the display for gender. The first line lists the variable name, gender, and the
variable label, Gender of respondents. Next the type of the variable, which is numeric
(byte), is shown. The range of this variable (shown as the lowest value, then the highest
value) is from 1 to 2, there are two unique values, and there are 0 missing values in the 20
cases. The information is followed by a table showing the frequencies, and values (If you
label the values using the method described in session 3, here you will also get labels). We
have 10 cases with a value of 1 and 10 cases with a value of 2.
Using the codebook is an excellent way to check your categorical variables, such as gender
and smokestatus. If, in the tabulation, for gender you saw any values other than 1 and 2 or
for smoker you saw any values other than 1, 2, 3, and . you would know that there are
21 | P a g e
errors in the data [‘.’ in the dataset represents missing value. There are other ways to
represent missing values too.]. Looking at these kinds of summaries of your variables will
often help you detect data-entry errors, which is why it is crucial to review your data after
entry.
2.8 Quick Review
In this section we covered
Creating a dataset
An example of questionnaire
Developing a coding system
Entering data
Saving your dataset
Checking the data
2.9 Exercises
1. For this exercise you need the dataset that you created firstsurvey.dta.
a. Using the Data Editor window, recode education into a new variable edurec
with three groups: > 13 years of education, 13-16 years of education, and More
than 17 years of education (Use Table 2.4.2 to recode education into the new
variable).
b. Generate a codebook for edurec and display the output.
c. Save your dataset into a new dataset and call it mysurveytrunc.dta.
22 | P a g e
Section 3: Preparing for data analysis
3.1 Introduction
This section shows how to prepare a dataset ready for analysis.
3.2 Planning your work
Most of the time spent on any research project is getting your data prepared for analysis. We
will cover a few basic steps in preparing your dataset. The data we will be using in this
tutorial and class are from different sources: National Health and Nutrition Examination
Survey (NHANES), Behavior Risk Factor Surveillance System (BRFSS), and Tobacco Use
Supplement to the Current Population Survey (TUS-CPS).
It is useful to create an outline of the steps needed to go from data collection to analysis. The
outline should include what needs to be done to collect or obtain the data, read and label the
data, make any necessary changes to the data, create any composite variables (sometimes
referred to scales), and finally create an analysis-ready version of the dataset. Our project
outline, which is for a small project, is as follows:
Consult NHANES/BRFSS/TUS-CPS documentation to determine the type of
variables needed.
Download the data and look for codebooks or descriptions of the variables.
Create a basic Stata dataset (e.g. cancer.dta, nhanes.dta): We have uploaded
these datasets in Blackboard.
Create variable and value labels.
Generate missing-value codes to Stata missing values.
Reverse code those variables that need it; verify.
Copy variables not reversed to named variables; verify.
Create the scale variable.
Save the analysis-ready copy of the dataset.
Note: The *.dta form of the datasets are uploaded in Blackboard. All of these datasets
are originally downloaded in different formats from the links of the respective surveys
and were later processed and saved into .dta format, ready for use in Stata. Here, *
means filename and .dta is its extension that determines the type of file. Remember that
you should download all your datasets from Blackboard into a folder “Stata” that you
created in C:\.
23 | P a g e
3.3 Loading data and inputting commands
Before starting to work with your dataset, you should read it with Stata. In order to do this,
first download nhanes.dta from Blackboard to your Stata folder, then type use
“C:\Stata\nhanes.dta”, clear at the Stata command line and press ENTER.
Remember Stata is case sensitive and you must preserve uppercase and lowercase letter. For
example, if you type Describe instead of describe in Command window then it will be
meaningless to Stata and yields an error message. In Command window, remember to avoid
periods that appear before the words and/or any other punctuation marks that appear after the
words throughout this tutorial.
Once you load your dataset you can look at its contents by typing describe. You should see
the output displayed below. The displayed output is a truncated output (Only a part of the
output is displayed):
. describe Contains data from C:\Documents and Settings\Mohammad\Desktop\Lava_Stata Tutorial\nhanes.dta obs: 1,100 vars: 62 9 Dec 2010 14:26 size: 550,000 (94.8% of memory free) -------------------------------------------------------------------------- variable storage display value variable name type format label label -------------------------------------------------------------------------- seqn double %12.0g Respondent sequence number riagendr double %12.0g Gender ridageyr double %12.0g Age at Screening Adjudicated - Recode ridagemn double %12.0g Age in Months at Screening - Recode ridageex double %12.0g Age in Months at Exam - Recode ridreth1 double %12.0g Race/Ethnicity – Recode
(Rest of the output is omitted)
Note that the dataset uploaded on Blackboard are the random subsample of all
observations. They are not the complete dataset for the respective survey and will be
used only for the purpose of this class.
Note that Every time you see word(s) typewriter font then you should type the
word(s) in the Command window and press ENTER.
24 | P a g e
In general, describe provides information about the number of variables and the number of
observations in your dataset. describe also indicates the percentage of the working memory
(RAM) allocated to Stata. It also gives detail information about a variable, storage type of the
variable, display format, value label, and variable label.
3.4 Looking at data
Using the command list, we get a closer look at the data. The command lists all the
contents of the data file. You can look at each observation by typing
. list +-----------------------------------------------------------------+ 1. | seqn | riagendr | ridageyr | ridagemn | ridageex | ridreth1 | | 45040 | 2 | 9 | 118 | 118 | 1 | |-------------------------------------- -----------------------|
|dmdmartl | dmdhhsiz | indhhin2 | duq200 | duq210 | duq220q | . | 4 | 2 | . | . | . |
+-----------------------------------------------------------------+
(Rest of the output is omitted)
You will see a period (.) for those variable with no information recorded. Stata calls it a
“missing value” or just “missing”. In Stata, a period or a period followed by any character a
to z indicates a missing value. Later in this manual we will discuss how to define missings
and how to handle missing values in Stata.
Scrolling through a large number of observations is tedious, so using list command is not
very helpful with a large dataset. Even with a small dataset, list can display too much
information to process easily. However, sometimes you can take a glance at the first few
observations to get a first impression or to check on the data. To scroll down to next page in
the result when you see –more- press SPACEBAR or if you only want to go to next line
press any key. After observing a few observations, you might want to stop listing and avoid
scrolling to the last observation. You can stop the printout by pressing q, for quit. Anytime
you see –more- on the screen, pressing q will stop listing results.
3.5 Creating variable and value labels
Now let’s add labels to the variables and their values. There are essentially three steps in
labeling:
i. Create a label for the variable itself
ii. Define a label for values
iii. Attach that label to a specific variable.
25 | P a g e
We first label our variable riagendr:
. label variable riagendr “Sex of Respondents”
Now the variable riagendr has a new label “Sex of Respondents”. You can see the new
label in the Variables window. If you use the command tab riagendr you will see this new
label in the output (note that tab is the short form used for tabulate). The command tab
generates frequency distribution of a variable. . tab riagendr Sex of | respondents | Freq. Percent Cum. ------------+----------------------------------- 1 | 544 49.45 49.45 2 | 556 50.55 100.00 ------------+----------------------------------- Total | 1,100 100.00
However, you will notice that no labels for the values have been defined yet. Now we want
to label the values of the categories of riagendr. To do this use the command label define.
First, we have to define a name for our labels. Let us call it ‘gender’.
. label define gender 1 “male” 2 “female”, modify
At this point, you have only defined a label but have not attached it to the variable. If you
type label list you will see all the labels in your dataset. You will find the label for
‘gender’ with all the category values defined.
Now we must attach these category labels to the variable:
. label value riagendr gender
Note, that the “gender” in the label define and label value commands is an arbitrary word. It
can be any word as long as it is the same word in the two commands. Now if we tabulate the
variable riagendr we get the categories of the variable that were defined above:
. tab riagendr Sex of | respondents | Freq. Percent Cum. ------------+----------------------------------- male | 544 49.45 49.45 female | 556 50.55 100.00 ------------+----------------------------------- Total | 1,100 100.00
26 | P a g e
Let us see another example with the variable dmdmartl (marital status):
First label the variable dmdmartl
. label variable dmdmartl “Marital Status of Respondents” Now the output for tabulate dmdmartl will give . tab dmdmartl
Marital | Status of | Respondents | Freq. Percent Cum. ------------+----------------------------------- 1 | 323 51.35 51.35 2 | 59 9.38 60.73 3 | 71 11.29 72.02 4 | 26 4.13 76.15 5 | 108 17.17 93.32 6 | 42 6.68 100.00 ------------+----------------------------------- Total | 629 100.00
Now define labels and attach them with the variable:
. label define mar 1 "Married" 2 "Widowed" 3 "Divorced" 4 "Separated" 5 "Never married" 6 "Living with partner" . label value dmdmartl mar
New output for tab dmdmartl is
. tab dmdmartl
Marital Status of | Respondents | Freq. Percent Cum. --------------------+----------------------------------- Married | 323 51.35 51.35 Widowed | 59 9.38 60.73 Divorced | 71 11.29 72.02 Separated | 26 4.13 76.15 Never married | 108 17.17 93.32 Living with partner | 42 6.68 100.00 --------------------+----------------------------------- Total | 629 100.00
Refer to http://www.cdc.gov/nchs/nhanes/nhanes2007-2008/nhanes07_08.htm to get the
correct value labels for the different categories of each variable in the dataset. For
example, Click Demographics and then go to Docs under Demographic Variables and
Sample Weights and on the right hand side you will see the table of contents with list of
demographic variables. Click the riagendr and you will notice that 1 is male and 2 is
female. Use these codes in defining the value labels. Similarly, click the dmdmartl and
you will notice the value descriptions we used earlier in defining the value labels for
this variable.
27 | P a g e
3.6 Creating and modifying variables
There are several commands that are used in creating, replacing, and modifying variables.
Some useful commands include generate, replace, rename, and recode.
3.6.1 Generate and Replace: generate creates a new variables, whereas replace changes the
contents of an existing variable. To ensure that you do not accidentally lose data, you
cannot overwrite an existing variable with generate and you cannot generate a new
variable with replace. The command syntax for both is same: you specify the name of the
command, followed by the name of the variable to be created or replaced. Then you place
an equal-sign after the variable name and specify an expression to be created or replaced.
You can create a new variable:
. generate agegrp=.
This will generate a new variable called agegrp with missing values for all observations.
When you use the command tab agegrp you will see no observations.
. tab agegrp no observations
replace changes the content of a variable. Below we change the content of the variable
agegrp to 9:
. replace agegrp=9
This will replace all the missing values of agegrp into 9.
Now when you tab agegrp you will see that the missing values have been replaced by 9
for the variable.
. tab agegrp agegrp | Freq. Percent Cum. ------------+----------------------------------- 9 | 1,100 100.00 100.00 ------------+----------------------------------- Total | 1,100 100.00
28 | P a g e
Now let us modify the ridageyr into the categorical variable agegrp. We have already
created agegrp. Now let us categorize age as agegrp into 0-17 years, 18-49 years and
50+ years.
. replace agegrp=1 if (ridageyr>=0 & ridageyr<18) . replace agegrp=2 if (ridageyr>=18 & ridageyr<=49) . replace agegrp=3 if ridageyr>49
Based on the variable ridageyr (age at screening), the above set of replace commands
will categorize the new variable agegrp into three levels: 0-17 years; 18-49 years; and
50+ years.
You can now generate frequency distribution of agegrp using the command . tab agegrp
[Output not shown]
Based on the previous example of labeling riagendr, you can now label the new variable
agegrp define its value labels and attach the labels to it. Re-generate its frequency
distribution table.
[Commands and Output not shown]
You can also generate a new variable from the existing variables using arithmetic signs
and other functions into expressions. Table 3.6.1.1 shows the arithmetic symbols that can
be used in expressions.
Naming variables: The names of the variables generated can be up to 32 characters
long. However, it is a good idea to keep the names concise (recommended not more
than 8 characters long) to save time when you have to type them repeatedly and work
with other statistical software packages. You cannot begin the names of the variables
with a number. Space(s) within the name of the variables is not allowed. You can build
your names with letter (A-Z and a-z), numbers (0-9), and underscores ( _ ). The
following names are not allowed:
_all double long _rc _b float
_n _skip byte if _N str#
_coeff in _pi using _cons int
_pred with
29 | P a g e
Table 3.6.1.1. Arithmetic Symbols
Symbol Operation Example
+ Addition mscore+fscore+sibscore
- Subtraction balance-expenses-penalty
* Multiplication income*0.75
/ Division expenses/income
^ Exponentiation (X2) X^2
Let us create a new variable bmi using the arithmetic function from the existing variables
bmxwt (weight in kg) and bmxht (Standing height in cm).
. generate bmi=bmxwt/((bmxht/100)^2)
This generate command creates a new variable called bmi which is defined as weight in
kilograms divided by the square of height in meters. This command can also be written
as:
. generate bmi=bmxwt/((bmxht/100)*(bmxht/100))
3.6.2 Renaming variables: If a variable already has a name you might want to change, then
you use rename command followed by the old name then a new name. For example,
. rename riagendr gender
The rename command is suitable when you only have a few variables to change as you
can only change one variable per rename command. However, in most circumstances,
using the generate command (e.g. generate gender=riagendr) would be preferred.
3.6.3 Recoding variables: You will often need to combine several values of one variable into a
single value. When you combine the values, we recommend you always generate a new
variable so that you could preserve the original variable. Recode can be used instead of
generate and replace command. Let us take the example of the age group created
above:
. recode ridageyr (min/17=1) (18/49=2) (50/max=3), gen(agegrprec)
With recode you assign new values to certain observations of a new variable according
to a coding rule. Using the generate( ) option stores the results in a new variable
instead of overwriting ridageyr. Here min refers to the lowest value of the variable
ridageyr, and max refers to the highest value. Note that the name of the new variable
agegrprec is arbitrary.
Compare the frequency distribution of agegrp and agegrprec.
30 | P a g e
3.7 Creating Scales
Let us develop a scale variable. We will construct one scale variable: Depression Severity
named as depsev using 9 items (dpq010, dpq020, dpq030, …, and dpq090) of the
depression screener. Each of these items is scored from 0 (not at all) to 3 (nearly every day).
Close observation of these variables in the codebook indicates that there are other values
assigned to each of these items and these values are either “7 = refused” or “ 8 = don’t
know”. Therefore, we create new variables for each of these items recoding missing values to
those values >3 or are missing:
. recode dpq010-dpq090 (0=0) (1=1) (2=2) (3=3) (else=.), gen (d1 d2 d3 d4 d5 d6 d7 d8 d9)
Then we create a new variable PHQ9Score which is the aggregate score of variables d1
through d9 each scored from 0 to 3. Therefore, the PHQ9Score ranges from 0 to 27 points.
Higher scores indicate a more severe depression.
. generate PHQ9Score=d1+d2+d3+d4+d5+d6+d7+d8+d9
Now let us generate a scale variable depsev by categorizing the PHQ9Score points such that
the points 0 denote “No Depression”, 1-4 “Minimal Depression”, 5-9 “Mild Depression”, 10-
14 “Moderate Depression”, 15-19 “Moderately Severe Depression”, and 20-27 “Severe
Depression”.
. recode PHQ9Score (0=0) (1/4=1) (5/9=2) (10/14=3) (15/19=4) (20/27=5), gen(depsev)
Check the variable depsev using a tab command:
.tab depsev
You may want to save your work as mynhanes.dta. You will need this dataset in later
sections and exercises.
. save “C:\Stata\mynhanes.dta”, replace
3.8 Quick Review
In this section we covered
Inputting commands
Loading data
31 | P a g e
Variables and observations
Looking at data
Labeling variables
Creating and modifying variables
Creating scales
3.9 Exercises
1. Open firstsurvey.dta.
a. With the help of the variable education generate a new variable educate using
following criteria.
If education = 8 then educate = 1; and
If education = 9 to 19 then educate = 2, where 1 = 0-8 years, and 2 =
9-19 years.
b. Label the variable and its values appropriately.
c. Using tab command, display the output for the variable educate.
d. Generate a new variable mincigrec where mincigrec is the score for time to first
cigarettes in a day (in minutes) such that
If firstcig = 1-5 then minicigrec = 1;
If firstcig = 6-30 then minicigrec = 2;
If firstcig = 31-60 then minicigrec = 3; and
If firstcig > 60 then minicigrec = 0
e. Generate another new variable avgcig where avgcig is the score for average
number of cigarettes smoked per day such that
If numcig = 1-10 then avgcig = 0;
If numcig = 11-20 then avgcig = 1;
If numcig = 21-30 then avgcig = 2; and
If numcig >30 then avgcig = 3
f. Generate heaviness of smoking index (hsi) which the sum of the scores of the
variables firstcig and numcig created in 1d and 1e. Label the new variable and
display it using tab command.
g. Save the dataset with a new name “tobsurvey.dta”.
32 | P a g e
Section 4: Working with commands, do-files, and results
4.1 Introduction
This section shows how to use do-files, which are simple text files that contain series of Stata
commands, which are executed one after the other. Do-files allow you to replicate your work,
something you should always ensure you can do. When you collaborate with colleagues, they
can use your do-file as a way to follow exactly what you did. It is hard enough to remember
all the commands you create in a session, and if there is a delay between work sessions, it is
impossible to remember all those commands.
4.2 How Stata commands are constructed
Stata has many commands and is not limited to the following examples:
list List value of variables
summarize Summary statistics
describe Describe data in memory or in file
codebook Describe data contents
tabulate Tabulate frequencies
generate Create or change contents of variables
egen Extensions to generate
correlate Correlations (covariances) of variables or coefficients
ttest Mean-comparison tests
regress Linear regression
alpha Compute interitem correlations (covariances) and Cronbach’s alpha
graph The graph command
Stata has a remarkably simple command structure. Stata commands are all lower-case.
Virtually all Stata commands take the following form:
What is a command? What is a do-file?
A command instructs Stata to do something, such as construct a graph, frequency
tabulation, or table of correlations. A do-file is a collection of commands. It tells Stata
what to “do”. A simple do-file might open a dataset, summarize the variables, create a
codebook, and then do a frequency tabulation of the categorical variables. A do-file can
include all the commands you use to label variables and values; it can recode variables,
average variables, define how you treat missing values.
33 | P a g e
command varlist if/in, options
The command is the name of the command, such as summarize, generate, or tabulate.
The varlist is the list of variables used in the command. For many commands, listing no
variables means that the command will be executed on all variables. If we said summarize,
Stata would summarize all the variables in the dataset. If we said summarize age
education, Stata would summarize just age and education variables. The variable list
could include one variable or many variables.
After the variable list come the if and in qualifiers regarding what will be included in the
particular analysis. Suppose that we have a variable called gender. A code of 1 means that
the participant is a male, and a code of 2 means that the participant is female. We want
summary statistics for age and restrict the analysis to males. To restrict the analysis, we
would say summarize age if gender==1. Here we use two equal signs, which is the Stata
equivalent to the verb “is”. So the command means “Summarize age if gender is coded with
a value of 1”. Why the two equal signs? The statement gender=1 literally means that the
variable called gender is a constant value of 1, but males are coded as 1 and females as 2 on
this variable.
Sometimes we want to run a command on a subset of observations, and so we use the in
qualifier. For example, we might have the command summarize age education in
1/200, which would summarize age and education in the first 200 observations.
Each command has a set of options that control what is done and how the results are
presented. The options vary from command to command. One option for the summarize
command is to obtain detailed results, summarizing the variables in more ways. If we wanted
to do a detailed summary of scores on age and years of education for adult males, the
command would be
. summarize age education if gender==1 & age>17, detail
The command structure is fairly simple, which is helpful for us because it is absolutely rigid.
This example used the ampersand (&), not the word “and”. If we had entered the word “and”,
we would have received an error message. Here are more examples with if statements:
. summarize age education if gender==2 . summarize age education if gender==1 & age>64 . summarize age gender if gender==2 & age>64 & education==12
When you have missing values stored as . or .a, .b, etc., you need to be careful about using
the if qualifier. Stata stores missing values internally as huge numbers that are bigger than
any value in your dataset. If you had missing data coded as . or .a and entered the command
34 | P a g e
summarize age if age>64, you would include people who had missing values. The correct
format would be . summarize age if age>64 & age<.
The <. qualifier at the end of the command is strange to read (less than dot) but necessary.
Table 4.2.1 shows the relational operators available in Stata.
Table 4.2.1 Relational operators used by Stata
Symbol Meaning == is or is equal to != or ~= is not or is not equal to > is greater than >= is greater than or equal to < is less than <= is less than or equal to
Here are few Stata commands and the results they produce. You need your Stata to first read
the dataset. Let us use the dataset firstsurvey that you created in Section 2. You can enter
these commands in the Command window to follow along.
. use “c:\Stata\firstsurvey.dta”, clear . summarize
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 20 10.5 5.91608 1 20 gender | 20 1.5 .5129892 1 2 age | 20 28.8 10.57106 17 54 education | 20 11.75 3.274704 8 19 smoked100 | 20 .75 .4442617 0 1 -------------+-------------------------------------------------------- initiation | 16 11.4375 1.965324 9 16 smokestatus | 16 1.4375 .7274384 1 3 numcig | 11 11.72727 4.540725 3 20 firstcig | 11 74.54545 77.40977 5 240 peersmoke | 11 1.909091 .9438798 0 3 -------------+--------------------------------------------------------
This summarize command does not include a variable list, so Stata will summarize all
variables in the dataset. It has no if/in restrictions and no options, so Stata summarizes all
the variables, giving us the number of observations with no missing values, the mean, the
standard deviation, and the minimum and maximum values. The statistics for the id variable
are not useful, but it is easier to get these results for all variables than it is to list all the
variables in a variable list, dropping only id.
35 | P a g e
We can add the detail option to our command to give more detailed information. Do this
for just one variable.
. summarize initiation, detail Age at initiation ------------------------------------------------------------- Percentiles Smallest 1% 9 9 5% 9 9 10% 9 10 Obs 16 25% 10 10 Sum of Wgt. 16 50% 11 Mean 11.4375 Largest Std. Dev. 1.965324 75% 12 12 90% 15 13 Variance 3.8625 95% 16 15 Skewness .9398362 99% 16 16 Kurtosis 3.281962
As expected, this method gives us more information. The 50% value is the median age at
initiation, which is 11 years. We also get the values corresponding to other percentiles, the
variance, a measure of skewness, and a measure of kurtosis.
4.3 Creating and Saving a do-file
Stata has a simple text editor called a Do-file Editor in which you can enter a series of
commands. You can run all the commands in this file or just some of them. You can then
edit, save, and open the commands in a do-file at a later date. Saving these do-files means not
only that you can replicate what you did and make any needed adjustments but also that you
will develop templates you can draw on when you want to do a similar analysis. To open the
Do-file Editor window, select Window>DO-file Editor>New Do-file Editor. You can also
open the Do-file Editor by clicking on the toolbar icon that looks like a notepad; see figure
4.3.1 for the icon in Stata. Figure 4.3.2a shows the Do-file Editor window.
Figure 4.3.1 The Do-file Editor icon on the Stata Menu
Do-file
Editor
36 | P a g e
Figure 4.3.2a The Do-file Editor
Figure 4.3.2b Icons on the Do-file Editor Toolbar
When you click on another window, the Do-file Editor can be hidden by it. You can bring the
Do-file Editor to the front again by clicking on it in the system toolbar or by using the
Alt+Tab key combination to move through the open windows until the Do-file Editor is
highlighted. You can avoid having the window disappear by arranging your desktop so that
the other windows do not overlap with the Do-file Editor window.
The icons on the Do-file Editor Toolbar shown in figure 4.3.2b are fairly standard. With
newer version of the Stata you will find additional icons in the toolbar. The first four icons let
you open a new do-file, open an existing do-file, save your file, and print the file. The fifth
icon allows you to perform search, although most people prefer to use Ctrl+f to perform a
search. The next three icons will cut, copy, and paste, although most people prefer to use
Ctrl+x, Ctrl+c, and Ctrl+v, respectively. The next two icons provide undo and redo functions.
The third from last icon will show content of the file in a viewer window. The second to last
icon runs the do-file quietly without sending results to the Result window. Finally, the last
icon runs the do-file yielding output in the Result window for each command. While running
37 | P a g e
the command, you can either run your entire do-file or select specific lines of command. You
select lines of command in the same way you would in a word processor, by highlighting
them. If you want to select a section of several lines, you do not need to highlight all the lines
completely; you can highlight some of each line. Here is an example where we want to run
two commands describe and summarize. In figure 4.3.3, we have selected only part of
describe and summarize commands.
Figure 4.3.3 Highlighting in the Do-file Editor
[Note: In preparing this tutorial, we have removed the line numbers being displayed later in
the do-file editor.]
We recommend that you include the name of the file as a comment in the first line of the do-
file. Placing an asterisk (*) at the beginning of the line marks the text as a comment that Stata
prints but does not interpret. We include this line because our do-file will have the name of
the file that created the output so that we know where to find the file if we need to change
something later on. We would also put the title of the project, date when the do-file was
created and the duration of the project as the comments in the do-file before putting the Stata
commands. Let’s type *my_first.do at the top of our blank do-file.
Another way of adding comments, especially long comments, to a file is by typing /* just
before the comment and */ right after the comment. Anything between the /* and */ will be
treated as a comment. Figure 4.3.4 shows what we will use in our example do-file. This type
of long comment is helpful for organizing a long do-file into sections with a new comment
explaining the purpose of each new section.
Notice that the comments appear in the Do-file Editor in a green font. This makes them easy
to find in a long do-file. Stata commands appear in a bold blue font. However you can
customize the font color and type.
Next we need to open the dataset. In our do-file we type
. use c:/Stata/firstsurvey.dta, clear
The next two commands we need to type are describe and summarize, which will describe
the dataset and then perform a summary of the variables, giving us the number of
observations, mean, standard deviation, minimum value, and maximum value. Unless there
38 | P a g e
are too many variables in your dataset, these are good commands to include at the beginning
of your do-file.
In general, one line equals one command in a do-file. This is a great way to distinguish one
command from another as long as each command is short enough to fit on one line. What
happens when you have a long command that extends for more than one line? Stata needs a
way to know that when you press the Return key, you are not really done with the command.
The solution is to put /// at the end of a line. This tells Stata that the next line is a
continuous of the previous one. We will illustrate the use of /// in the graph pie commands
shown in figure 4.3.4. (We will cover these commands in the next section. For now, just
enter them into the Do-file Editor.)
Saving a do-file: You must remember to save your do-file. You do this much like you would
save a file in any other program; that is, click on File>Save As…. Then you can type the
name of the file; in our example, we are using my_first. You can also browse to find the
project folder where you want to save the file. For this class we are saving all the related files
in the c:/Stata/ folder.
Figure 4.3.4 shows our do-file, my_first.do, as it appears in the Do-file Editor. Notice that
when we saved the do-file, the filename now appears at the top of the Editor.
Run the do-file by clicking on the icon at the far right of the Do-File Editor toolbar.
FIGURE 4.3.4. Commands in the Do-file Editor window
39 | P a g e
4.4 Copying your results to a word processor
In some lab assignments, you will have to show your output. Save your results by
highlighting the text you want to save from the Result window, right click on the highlighted
text, and then select Copy Text from the menu (or using Ctrl+c after you have highlighted
your text). You can then paste the text into your Word processor. It is a good idea to include
the commands that are in the Results window, because these give you a record of what you
did. Except when there is no data manipulation, commands like these are no substitute for a
do-file that includes everything you did in preparing the data.
When you copy results from Stata’s Result window to your Word processor, the format may
not look friendly, such as the things may not line up properly. The simplest solution to this
alignment problem is to change the font and probably the font size, depending on your
margins. We usually use the Courier or Courier New font at 9 point. Sometimes you might
have to tweak the margin of the Word processor.
4.5 Quick Review
In this section we covered
How Stata commands are constructed
Creating and Saving a do-file
Copying your results to a word processor
4.6 Exercises
1. The exercise for this section would be incorporated in the exercise of Section 5.
Saving tabular output
For tabular output (say, the result of a summarize command), you may want to select
just that portion of the text that appears as a table, right-click on it, and select Copy
Table, Copy Table as HTML, or Copy as Picture from the menu. You can then paste this
text into your Word processor. Copy Table as HTML option pastes it as a table like one
you would see on a web page. Copy as Picture works nicely as long as you do not need
to format the output to a particular style.
40 | P a g e
Section 5: Descriptive Statistics and Graphs
5.1 Introduction
This section shows how to produce descriptive statistics and graphs using Stata. These
techniques are commonly used to explore the data before making decisions about further
analyses. Descriptive statistics are statistical procedures used to summarize, organize, and
simply data.
Deciding which statistics to use to describe a variable largely depends upon the level of
measurement of the variable: nominal, ordinal, and interval/ratio levels of measurements.
Levels of Measurement
Nominal: At the nominal level of measurement, numbers are assigned to a set of categories
for the purpose of naming, labeling, or classifying the observations, however, the categories
have no specific ordering. Gender is an example of a nominal level variable. Using the
numbers 1 and 2, for instance, we can classify our observations into the categories ‘female’
and ‘male’, with 1 representing female and 2 representing male. The numbers 1 and 2 are
used to represent the different categories; we do not imply anything about the magnitude or
quantitative difference between the categories. Other examples of nominal level variables are
ethnicity, nationality, and race.
Ordinal: Ordinal level variables assign number to rank-ordered categories ranging from
lowest to highest. The classic ordinal level measure is a Likert scale that has categories
strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree. We can
say that a person in the category ‘strongly agree’ agrees with the statement more than a
person in the ‘agree’ category, but we do not know the magnitude of the differences between
the categories. Another example of an ordinal variable is age groups with categories young,
middle-age, and old or 16-40, 41-64, and 65 and over.
Interval/ratio (or Quantitative): For interval/ratio level of measurement the categories
(values) of a variable can be rank-ordered and the differences between these categories
(values) are constant. Examples of variables measured at the interval/ratio level are age,
income, height, and weight. With these variables we can say which value is greater or smaller
and by how much.
Discrete or continuous: Nominal and ordinal measures are categorical variables and are
always discrete. Interval/ratio can either be discrete or continuous. For example, number of
children is interval level but discrete as we say 0, 1, 2, 3,… children and not 2.5 children.
Height is an interval level but is continuous as we say 0 inch, 66 inch, 66.2 inch, 66.7 inch,….
41 | P a g e
5.2 Where is the center of a distribution?
Descriptive statistics are used to describe distribution. Three measures of central tendency
describe the center of a distribution: mode, median, and mean. These are commonly called
averages. Refer to Table 5.2.1 as a general guideline to make a decision about which measure
of central tendency is appropriate for each level of measurement.
Table 5.2.1
Level of measurement Mode Median Mean
Categorical, no order (nominal, e.g., gender) Yes No No
Categorical, ordered (ordinal, e.g. social support) Yes Yes Yes*
Quantitative (interval/ratio, e.g. age) Yes Yes Yes *Many researchers use the mean when there are several ordered categories
5.3 How dispersed is the distribution?
Besides describing the central tendency or average in a distribution, descriptive statistics
describe the variability or dispersion of observations. Are they concentrated in the middle?
Do they trail off in one direction? Are they widely dispersed?
When there are only a few values or categories, we can use a frequency distribution
(tabulation) to describe the variable. This distribution shows each value or category of a
variable and tells how many observations have that value or fall into that category. We can
Averages
Mode: It is the value or the category that occurs most often in a distribution. This can be
applied to unordered categorical [I don’t’ think unordered categorical is defined above]
variables such as gender, marital status, or race/ethnicity.
Median: It is the positional average that divides a distribution into two halves. Half the
observation will have higher value and half will have a lower value than the median. This
can be applied to categories that are ordered (e.g., religiosity, job satisfaction, and level of
agreement) or to quantitative variables (e.g., age, education, and income).
Mean: Commonly known as the arithmetic average, is computed by adding all the scores in
the distribution and dividing by the number of scores. While very high or low values affect
the magnitude of the mean, they have little impact on the median. . Although Warren Buffet
would scarcely change the median income of a community, his moving to a small town
would raise the mean a lot.
42 | P a g e
also use graphs to describe the dispersion of a distribution. When there are only a few
categories being shown, the most common graphs to use are pie charts and bar charts.
For a quantitative variable, we will usually want one number to represent the dispersion. The
standard deviation (SD) is used, especially with variables that have many possible values. SD
is defined as the measure of variability and tells us how much variation or dispersion is there
in your data with respect to the mean of the distribution. In general SD tells how spread out
your data are from the mean. Note that smaller the SD the observations tend to be closer
towards the mean (Figure 5.3.1). As the SD increases the distribution becomes more
dispersed (Figure 5.3.2). For a normal distribution 68.2% of the distribution lies within
±1*SD, 95.4% lies within ±2*SD and 99.6% lies within ±3*SD.
A graph that can show the spread of an ordinal or a quantitative variable is called as
histogram.
Figure 5.3.3 shows that the tail of the distribution is towards the left and hence is called
skewed towards the left. In this type of distribution the mean will be less than the median.
Figure: 5.3.1 Distribution with SD=1
Figure: 5.3.2: Distribution with SD=2
43 | P a g e
Figure 5.3.4 shows that the tail of the distribution is towards the right and hence is called
skewed towards the right. In this type of distribution the mean will be greater than the
median.
5.4 Statistics and graphs – Unordered categories
About all we can do to summarize a categorical variable that is unordered is to report the
mode and show a frequency distribution or a graph (pie chart or bar chart).In this section we
will use the dataset, firstsurvey.dta that was labeled in our previous exercise. This dataset
has categorical (unordered and ordered) and continuous variables. gender, smoked100,
and smokestatus are unordered (nominal) categorical variables.
. use c:/Stata/firstsurvey.dta
Now to get one-way tables of frequency distributions for a variable or a list of variables we
use tabulate or tab1 commands, respectively.
. tabulate gender
Gender of | respondents | Freq. Percent Cum. ------------+----------------------------------- Male | 10 50.00 50.00 Female | 10 50.00 100.00 ------------+----------------------------------- Total | 20 100.00
or . tab1 gender smoked100 smokestatus -> tabulation of gender
Gender of | respondents | Freq. Percent Cum. ------------+----------------------------------- Male | 10 50.00 50.00 Female | 10 50.00 100.00 ------------+----------------------------------- Total | 20 100.00 -> tabulation of smoked100
Smoked at least | 100 cigarettes | in life | Freq. Percent Cum. -----------------+----------------------------------- No | 4 20.00 20.00 Yes | 16 80.00 100.00 -----------------+----------------------------------- Total | 20 100.00
44 | P a g e
-> tabulation of smokestatus Current | smoking | status | Freq. Percent Cum. ------------+----------------------------------- Everyday | 11 68.75 68.75 Somedays | 3 18.75 87.50 Not at all | 2 12.50 100.00 ------------+----------------------------------- Total | 16 100.00
The tabulation for gender tells us that 50% of the sample of 20 respondents comprises
male and the remaining 50% were females. Note that tabulate command can be used for
drawing frequency distribution for only one variable, but tab1 can generate frequency
distribution tables for each of the listed variables such as gender, smoked100, and
smokestatus. The mode for gender is not as predominant a category. For smoked100, 80%
of the respondents smoked more than 100 cigarettes. This is a clear mode as it is so much
more frequent than the other category. The mode for smokestatus is everyday smoker as it
is the most frequently occurring (68.75%) category.
Please note that the total observation for the variable smokestatus is 16 when our sample
size is 20. This is because not all respondents in this survey smoked at least 100 cigarettes in
life. Therefore, there are missing observations for the variable smokestatus and tabulate
command yields frequency distribution for non missing observations. However, if you want
to calculate frequency distribution from the whole sample (including missing observations
represented by .) then use the following command:
. tabulate smokestatus, miss
Current | smoking | status | Freq. Percent Cum. ------------+----------------------------------- Everyday | 11 55.00 55.00 Somedays | 3 15.00 70.00 Not at all | 2 10.00 80.00 . | 4 20.00 100.00 ------------+----------------------------------- Total | 20 100.00
45 | P a g e
In section 4, we wrote a pie chart commands in the do-file. Here we will create the pie chart.
Open your my_first.do file by selecting File>Open. Browse your folder and select the file
my_first.do and click Open. From the my_first.do file select graph pie, over (peersmoke) title (Number of friends who smoke) ///
note (firstsurvey.dta) plabel (_all name)
Then run the selected command by clicking on the icon “Do Selected Line” at the far right of
the Do-File Editor toolbar. This will provide a visual display of the distribution of
peersmoke in our first survey. The size of each piece of the pie is proportional to the
percentage of the people in that category. This pie chart shows that majority of the
respondents had 6-10 friends who smoke cigarettes.
Figure 5.4.1 Pie chart of number of friends who smoke
None
1-5 friends
6-10 friends
>10 friends
None 1-5 friends
6-10 friends >10 friends
firstsurvey.dta
Number of friends who smoke
Obtaining both numbers and value labels
Before doing the tabulations, you might want to type the command numlabel _all,
add. After you enter this command, whenever you type the tabulate command, Stata
reports both the numbers you use for coding the data (1, 2, 3,..) and the value labels
(male, female). Later, if you do not want to include both of these, you can drop the
numerical values by using the command numlabel _all, remove. Practice using
these commands as an exercise on your own.
46 | P a g e
It is possible to improve the default pie chart. The default pie chart is a bit hard to read
because it assumes you want each slice a different color, and this will not work well when
printing in black and white. Because many publications require black and white printing, we
should edit the pie chart.
We can open the Graph Editor by right-clicking on the pie chart and selecting Start Graph
Editor. This expands the window that has the graph and adds a panel on the side of the chart
with things we might want to change (Figure 5.4.2). On the left side is the pie chart we will
edit, and on the right are the names of the parts of the pie chart.
Figure 5.4.2 The Graph Editor
You might be interested to edit some or all of the labels. For example, let us change the label
“None” to “0 friends”. Click on the plus sign by legend and the plus sign by key region.
Double-click on label [1], a window “Textbox properties” will pop-up where we will change
the Text from None to 0 friends. Click Apply and OK. You can also do this by double
clicking on the label None under the pie chart and renaming it as 0 friends. You might
have noticed that the change appears only in the label, however, the name of the plotregion
“None” remains unchanged. Click on the plus sign by plotregion1. Double click on
pielabel[1] and change the text from None to 0 friends. Select color White in the Text
Styles box. Click Apply and then OK. You will also notice that the font color inside the pie
diagram changes to white.
Next click on the plus (+) sign by plotregion1, and then double-click on pieslices[1]. Here we
will pick Black as the Color and 100% as the Fill intensity. For pieslices[2], pick Black and
70%. For pieslices[3], pick Black and 50%. For pieslices[4], pick Black and 30%, check
47 | P a g e
Explode slice, and make sure the distance is Medium. You can check explode to the category
that you want to focus on. Try experimenting with other options. Figure 5.4.3 shows the
edited pie chart.
Figure 5.4.3 Edited Pie Chart
We have only scratched the surface of what you can do with the powerful Graph Editor. For
example, you can click on the T (Add Text Tool) to the left of the graph and then click
somewhere on the figure. A dialog box opens so you can add text. Remember to click Apply
and then click OK. You could then click on the \ (slash or Add Line Tool) just below the T,
and draw a line from the text to the piece of pie it describes.
A bar chart is more attractive than a pie chart for many applications. Type the following
commands in your do-file and run the command
. histogram peersmoke, discrete percent gap(10) addlabel /// xtitle (Number of peer smokers) xlabel /// (, angle(forty_five) valuelabel) title (Number of friends who smoke)
You will get the output as shown in the Figure 5.4.4.
Figure 5.4.4 Bar chart of peer smokers
0 friends
1-5 friends
6-10 friends
>10 friends
0 friends 1-5 friends
6-10 friends >10 friends
firstsurvey.dta
Number of friends who smoke
9.091
18.18
45.45
27.27
010
20
30
40
50
Perc
ent
-1
Non
e
1-5
frien
ds
6-10
friend
s
>10 fr
iend
s
Number of peer smokers
Number of friends who smoke
48 | P a g e
Preparing a graph becomes easier when you use the dialogue box. Here we will create the
previous pie chart and the bar chart using the dialogue box. In Stata window, select
Graphics>Pie chart and look at the Main tab. If this dialog still has stored information from a
previous chart, you should click on the Reset icon (Figure 5.4.5) to clear the dialog box.
Figure 5.4.5 Reset icon
Type peersmoke in the category variable box. This uses the categories we want to show as
pieces of the pie. Under the Titles tab, enter the title “Number of friends who smoke” and the
name of dataset firstsurvey.dta we used as a Note. Under the Options tab, click on Order
by this variable and type peersmoke. Also check Exclude observations with missing values
(casewise deletion) because we do not want these, if any, to appear as a separate piece of the
pie. Under Slices option, select label properties (all) from the Label section. Select Name in
the Label type box. Click Accept and then click OK. You will obtain similar pie chart as in
Figure 5.4.1.
Now let us obtain the previous bar chart using the dialog box. From the Graphics menu,
select Histogram. Here we are creating a bar chart rather than a histogram, but this is the best
way to produce a high-quality bar chart using Stata.
Bar Charts and Histograms
Bar Chart: A bar chart is a pictorial representation of a frequency distribution of either a
nominal or ordinal data. The various categories into which the observations fall are
presented along a horizontal axis of a bar diagram. A vertical bar is drawn above each
category such that the height of the bar represents either the frequency or the relative
frequency of observations within that category. The bars should be of equal width and
separated from one another so as not to imply continuity.
Histogram:A histogram depicts a frequency distribution for discrete or continuous data. The
horizontal axis in a histogram displays the true limits of the various intervals. The true limits
of an interval are the points that separate it from the intervals on either side. The vertical axis
of a histogram depicts either the frequency or the relative frequency of observations within
each interval. The frequency associated with each interval in a histogram is represented by
the bar’s area and not its height. The area of the entire histogram sums up to 100%, or 1.
49 | P a g e
On the Main tab, type peersmoke in the Variable box. Click on the button next to Data are
discrete. In the section labeled Y axis, click on the button next to Percent. The trick to
making this histogram a bar chart is to click on Bar properties in the lower left of the Main
tab. This opens another dialog box where the default is to have no gap (bar gap=0) between
the bars. Change this to a gap of 10, which sets the gap between bars to 10 percent of the
width of a bar. Click Accept. If you switch to Titles tab, you can enter a title, Number of
friends who smoke. Next switch to the X axis tab and click on Major tick/label properties.
This opens another dialog box where you select the Labels tab and check the box for Use
value labels. Sometimes the value labels are too wide to fit under each bar. You may need to
create new value labels that are shorter. However, if they are just a little bit too wide, you can
change the angle. Click on Angle and select 45 degrees from the drop-down menu. Click on
Accept. Finally switch back to the Main tab. In the lower right corner of the dialog box, click
on Add height labels to bars. Because we are reporting percentages, this option will show the
percentage in each peersmoke at the top of each bar. Click OK and you will get the bar chart
as shown in Figure 5.4.4.
Sometimes you may have a larger number of categories. When this happens, Stata’s default
will show only a limited number of value labels, so some of the bars will be unlabeled. If you
want to label all of them, you need to go to the X axis tab and click on Major tick/label
properties to open a dialog box we opened before. On the Rule tab, click on Suggest # of
ticks and enter the number of bars in the box by Ticks.
5.5 Statistics and graphs- Ordered categories and variables
When our categories are ordered, we can use the median to measure the central tendency.
When there are only a couple of categories, however, the median does not work well. Here is
an example where there are several categories for the education variable. We might want to
report the median or mean and the SD. Let us run commands to get a frequency distribution
and summary details of the variable education.
. tab1 education -> tabulation of education Highest | level of | educational | attainment | Freq. Percent Cum. ------------+----------------------------------- 8 | 4 20.00 20.00 9 | 3 15.00 35.00 10 | 2 10.00 45.00 11 | 1 5.00 50.00 12 | 2 10.00 60.00
50 | P a g e
13 | 2 10.00 70.00 14 | 1 5.00 75.00 15 | 2 10.00 85.00 16 | 2 10.00 95.00 19 | 1 5.00 100.00 ------------+----------------------------------- Total | 20 100.00 . summarize education, detail Highest level of educational attainment ------------------------------------------------------------- Percentiles Smallest 1% 8 8 5% 8 8 10% 8 8 Obs 20 25% 9 8 Sum of Wgt. 20 50% 11.5 Mean 11.75 Largest Std. Dev. 3.274704 75% 14.5 15 90% 16 16 Variance 10.72368 95% 17.5 16 Skewness .5137804 99% 19 19 Kurtosis 2.240513
The frequency distribution produced by tab1 is probably the most useful way to describe the
distribution of an ordered categorical variable. We can see that the distribution is skewed to
the right with higher frequencies at lower level of educational attainment.
Although the tabulation gives us a good description of the distribution, we often will not have
the space to report this level of detail. The median (50th
percentile) is provided by the
summarize command. The median for this distribution is 11.5 corresponding to less than 12
years of educational attainment. Even though these are orderedl categories, many researchers
would report the mean. The mean assumes that the quantitative values 8-20+ are interval-
level measures. Though the mean (Mean = 11.75) is usually a good measure of central
tendency, in this example median is preferred because the distribution of educational
attainment is skewed (Figure 5.5.1). Although this variable is clearly ordinal, many
researchers treat variables like this as if they were interval and rely on the mean as a measure
of central tendency. If we are in doubt about which measure of central tendency to report, it
may be a good idea to report both the median and the mean.
51 | P a g e
Here is how we can create a histogram showing the distribution of educational level.
. histogram education, discrete percent title /// (Highest Level of educational attainment) /// note (firstsurvey.dta) xtitle (Educational level) scheme(s1mono)
Figure 5.5.1. Histogram of educational level attained
Remember, you can try creating a histogram using the dialog box as described in section 5.4.
The histogram in Figure 5.5.1 allows the reader to quickly get a good sense of the
distribution. The bars on the left are a bit taller than the bars on the right, which indicates
that there are a disproportionate number of people with lower levels of education
(Mean=11.75 years, Median=11.25 years, Mode=0-8 years of education, Standard deviation
(SD)=3.27).
5.6 Statistics and graphs-quantitative variables
We will study two variables: age and numcig (number of cigarettes per day). Histogram and
box plots are some common graphs used to display quantitative variables. We will usually
use the mean or median to measure the central tendency for quantitative variables. The SD is
the most widely used measure of dispersion, but a statistic called the interquartile range is
used by the box plots presented in this section.
Let’s start with age, age of the respondents. Computing descriptive statistics for quantitative
variables is easy. Example,
05
10
15
20
Perc
ent
5 10 15 20Educational level
firstsurvey.dta
Highest Level of educational attainment
52 | P a g e
. summarize age, detail Age of respondents ------------------------------------------------------------- Percentiles Smallest 1% 17 17 5% 18 19 10% 19 19 Obs 20 25% 20.5 20 Sum of Wgt. 20 50% 25.5 Mean 28.8 Largest Std. Dev. 10.57106 75% 36.5 41 90% 45 43 Variance 111.7474 95% 50.5 47 Skewness .9892077 99% 54 54 Kurtosis 2.848827
This output shows that the average age of the respondent is 28.8 years. The median is 25.5
years. Because the mean is greater than the median when the distribution is positively
skewed, we can assume that the distribution is positively skewed (trails off on the right side).
The SD is 10.57 years, which tells us that the age of the respondents varies widely.
Let us use some graphs to describe the distribution of age. First we will create a simple
histogram by using frequency instead of percentage. Use the following command:
. histogram age, frequency title (Age of the respondent) /// note (firstsurvey.dta) xtitle (Age in years) scheme (s1mono)
Figure 5.6.1. Histogram of age of the respondents
05
10
15
Fre
qu
en
cy
10 20 30 40 50Age in years
firstsurvey.dta
Age of the respondent
53 | P a g e
This simple command gives us a quick view of the distribution. This graph shows that it is
skewed to the right. Now let us see the distribution for males and females separately.
. histogram age, frequency title (Age of the respondent) /// note (firstsurvey.dta) xtitle (Age in years) scheme (s1mono) by (gender)
Figure 5.6.2. Histogram of age of the respondents by gender
The skeweness still persists for the distribution of age by gender. The graphs show that
compared to females, there are more males in the younger age groups.
Now to get descriptive statistics for both male and female separately, we need a new
command:
. by gender, sort: summarize age -------------------------------------------------------------------------- -> gender = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 10 25.4 6.432556 19 38 ---------------------------------------------------------------------------> gender = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 10 32.2 12.99402 17 54
02
46
8
10 20 30 40 50 10 20 30 40 50firstsurvey.dta firstsurvey.dta
Male Female
Age of the respondent Age of the respondent
Fre
qu
en
cy
Age in yearsGraphs by Gender of respondents
54 | P a g e
This commands sorts the data by gender and then summarizes the variable age for each
gender.
We can customize the output for getting the descriptive statistics by using the following
command:
. tabstat age, statistics (n mean median sd min max) /// by (gender) columns (statistics)
Summary for variables: age by categories of: gender (Gender of respondents) gender | N mean p50 sd min max -------+------------------------------------------------------------ Male | 10 25.4 23.5 6.432556 19 38 Female | 10 32.2 27.5 12.99402 17 54 -------+------------------------------------------------------------ Total | 20 28.8 25.5 10.57106 17 54 --------------------------------------------------------------------
Note that the second command produces the descriptive statistics for age of the total
distribution together with that by gender. Stata refers to the median as p50, or the 50th
percentile. We can change the name of p50 to median in a word file after we copy the output
as text or table (and not as a picture) from the output window and paste it into the word file.
Females were older than males in the distribution. Because the means are bigger than the
medians for each group, we can assume the distribution to be positively skewed (skewed to
the right) as was evident from the histogram in figure 5.6.2.
A horizontal (hbox) or vertical (box) box plot is an alternative way to represent the
distribution of a quantitative variable like age. The command for the horizontal box plot is:
. graph hbox age, over (gender) title (Age of the respondents) /// subtitle (by gender) note (firstsurvey.dta)
Figure 5.6.3. Horizontal box plot of age by gender
20 30 40 50 60Age of respondents
Female
Male
firstsurvey.dta
by gender
Age of the respondents
55 | P a g e
We see that the distribution of age for both male and females are skewed to the right as both
have the longer whiskers towards the right. The length of the box (shaded region) in
horizontal box plot is the interquartile range and the interquartile range for females is larger
than that for males, i.e. the distribution of female is more dispersed than the males. The
vertical line inside the box is the median and the median for male is around 24 years and for
female it is close to 38 years. We can also see a small dot beyond the end of a whisker of the
box plot for males. That observation is regarded as an outlier or an observation with extreme
value.
Try out generating vertical box plot using box in place of hbox in the above command.
5.7 Cross-tabulation for two categorical variables
Cross-tabulation is a technical term for a table that has rows representing one categorical
variable and columns representing another. These tables are sometimes called contingency
tables because the category a person is in on one of the variable is contingent on the category
the person is in on the other variable. For example, the category people are in on being
current smoker may be contingent on their gender. If we have one variable that depends on
the other, we usually put the dependent variable as the column variable and the independent
variable as the row variable in this tutorial.
Let us start with a basic cross-tabulation of whether a person has smoked at least 100
cigarettes in life and their gender. Say you decide that women are less prone to taking health
risk behaviors, like smoking. Therefore, whether a person has smoked at least 100 cigarettes
in life is dependent upon gender, i.e. the variable smoked100 is the dependent variable and
gender is the independent variable in our firstsurvey.dta. Let us use the command:
. tabulate gender smoked100 Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 Female | 2 8 | 10 -----------+----------------------+---------- Total | 5 15 | 20
Now we are interested to compute the percentages so that each row adds up to 100%.
56 | P a g e
. tabulate gender smoked100, row +-------------------+ | Key | |-------------------| | frequency | | row percentage | +-------------------+ Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 | 30.00 70.00 | 100.00 -----------+----------------------+---------- Female | 2 8 | 10 | 20.00 80.00 | 100.00 -----------+----------------------+---------- Total | 5 15 | 20 | 25.00 75.00 | 100.00
The frequencies at the top of each cell are hard to interpret because each row and each
column has a different number of observations. One way to interpret a table is to use
percentage, which takes into account the number of observations within each category of the
independent variable (predictor). The percentages appear just below the frequencies in each
cell. Note that the percentages add up to 100% for each row. Overall, 75% of the respondent
said “yes” they smoked at least 100 cigarettes in life. And 25% said “no”. However, females
(80%) were relatively more likely than the males (70%) to report that they have smoked
>100 cigarettes in their life, regardless of the reason. Thus we say that 70% of the men
compared with 80% of the women smoked at least 100 cigarettes in their life (as opposite to
what we expected). The row option used in Stata command produced row percentage.
5.8 Chi-squared test
The difference between women and men seems small (10%), but is this difference merely
due to chance? Or is the difference statistically significant and gender is actually associated
with having ever smoked at least 100 cigarettes? The sample size matters in coming to a
concrete conclusion.
We use Chi-squares ( 2) statistic to test the likelihood that our result occurred by chance. If it
is extremely unlikely to get this much difference between men and women in a sample of this
size by chance, we can be confident that there was a real difference between women and
men, but we still need to look at the percentages to decide whether the statistically significant
difference is substantial enough to be important.
57 | P a g e
The chi-squared test compares the frequency in each cell with what you would expect the
frequency to be by chance, if there were no relationship. The expected frequency for a cell
depends on how many people are in the row and how many are in the column. Because of the
small sample size, we have very few people in each cell.
In the cross-tabulation command, we would add extra options - chi2 expected such as,
. tabulate gender smoked100, chi2 expected row
The option chi2 produces Pearson’s Chi-squared statistics with p-value. The option expected
would produce the expected frequencies in each cell if there was no relationship between the
two variables. Note that we usually do not ask to produce expected frequencies.
. tabulate gender smoked100, chi2 expected row +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | row percentage | +--------------------+ Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 | 2.5 7.5 | 10.0 | 30.00 70.00 | 100.00 -----------+----------------------+---------- Female | 2 8 | 10 | 2.5 7.5 | 10.0 | 20.00 80.00 | 100.00 -----------+----------------------+---------- Total | 5 15 | 20 | 5.0 15.0 | 20.0 | 25.00 75.00 | 100.00 Pearson chi2(1) = 0.2667 Pr = 0.606
At the bottom of the table, Stata reports Pearson chi2(1) = 0.2667 and Pr = 0.606,
which would be written as 2 (1, N=20) = 0.2667; p (Pr = 0.606)> 0.05 not significant. Here
we have 1 degree of freedom. Note that whenever Stata reports p=0.000, we have to report it
as p<0.001. [As a general rule, any p-values less than 0.05 is considered to have significant
association between the variables.]
58 | P a g e
To summarize, we can say that women are more likely to report that they smoked at least 100
cigarettes in their life for any reason than are men. In the sample of 20 people, 80% of the
women said that they smoked at least 100 cigarettes in their life for any reason compared
with 70% of the men. This relationship between gender and smoked at least 100 cigarettes in
life was not statistically significant (p=0.606).
5.9 Quick Review
In this section we covered
Some basic statistical terms
One-way frequency distribution
Summary statistics
Working with graphs and graph editor
o Pie chart
o Histogram
o Box plot
Cross-tabulation of two categorical variables
Chi-squared test
5.10 Exercises
1. Students are required to submit the output of this exercise. The outputs of this exercise
include i) a word file with the requested graphs and charts and ii) a do-file for the whole
exercise.
Degree of freedom
Degree of freedom is equal to (R - 1)*(C - 1), where R= number of rows and C=
number of columns in a R by C table. In the above table between gender and
smoked100, there are 2 possible outcomes each for independent and dependent
variables. Therefore, it forms a 2 by 2 table. Therefore its degree of freedom=(2-1)*(2-
1) = 1.
59 | P a g e
A. Open a new do-file. Give an appropriate title of the do-file as a comment. You can also
include a brief description of the do-file under the title. Type the commands that are
necessary for each of the following requests in the do-file.
a. Open mynhanes.dta
b. Use the variable ridreth1 and label it as “Race/ethnicity of respondents”.
Assign and attach value labels to the codes of ridreth1. Refer to the website of
CDC/NHANES provided in Section 3 of this tutorial to define values/codes of the
variable ridreth1.
c. Generate a frequency distribution table for the variable ridreth1.
d. Create a pie-chart for ridreth1.
Run your do-file. Copy the pie-chart and paste it in your word file. In the word
file from the pie-chart identify race/ethnicity with largest number of distribution
in the sample.
B. Continue working on your do-file. Now type a command to diagrammatically show the
distribution of weight in kg (bmxwt) by gender. Save your graph in the previous word-
file. Briefly explain the graph.
C. What are the mean and median weights for men and women? Type the command in the
do-file to generate these statistics.
D. Generate a variable for depression based on the severity scale of depression. Name the
new variable as deprs such that deprs=0 if depsev=0 and deprs=1 if depsev=1-5.
Label the new variable and its values.
E. Conduct a Chi-squared test to see if depression (deprs) is associated with having smoked
at least 100 cigarettes in life (smq020). Copy and paste the table generated in the word-
file. Summarize your results.
(Note: If you have not run the commands for B through E, select the commands
associated with these questions and run them.)
Save your do-file and the word-file and send them to your TA’s email:
60 | P a g e
Bibliography
Acock AC. A Gentle Introduction to Stata. 3rd ed. College Station, TX: Stata Press; 2010.
Kohler U, Kreuter F. Data Analysis Using Stata. 2nd ed. College Station, TX: Stata Press; 2009.
Pagano M, Gauvereau K. Principles of Biostatistics. 2nd ed. USA: Duxbury Press; 2000.
Pevalin D, Robson K. the stata survival manual. 1st ed. New York, NY: Open University Press;
2009.