Stata Tutorial -

1 | P a g e

Disclaimer

We would like to acknowledge that in preparing this tutorial we have used materials from

various sources listed in the bibliography. This tutorial will be used for the lab exercise of the

course “HPRO 805: Applied Research in Public Health”.

2 | P a g e

Table of Contents Section 1: Getting Started ..................................................................................................................4

1.1 Convention: ......................................................................................................................................... 4

1.2 Executing a Command ........................................................................................................................ 4

1.3 Working along with the tutorial: ......................................................................................................... 4

1.4 The Stata screen .................................................................................................................................. 5

Using an existing dataset ...................................................................................................................... 6

1.5 An example of a short Stata session ................................................................................................... 7

1.6 Quick Review ....................................................................................................................................... 9

1.7 Exercises .............................................................................................................................................. 9

Section 2: Entering data ................................................................................................................... 11

2.1 Introduction ...................................................................................................................................... 11

2.2 Creating a dataset ............................................................................................................................. 11

2.3 An example of questionnaire ............................................................................................................ 12

2.4 Developing a coding system ............................................................................................................. 14

2.5 Entering data ..................................................................................................................................... 17

2.6 Saving your dataset ........................................................................................................................... 19

2.7 Checking the data ............................................................................................................................. 20

2.8 Quick Review ..................................................................................................................................... 21

2.9 Exercises ............................................................................................................................................ 21

Section 3: Preparing for data analysis ............................................................................................... 22

3.1 Introduction ...................................................................................................................................... 22

3.2 Planning your work ........................................................................................................................... 22

3.3 Loading data and inputting commands ............................................................................................ 23

3.4 Looking at data .................................................................................................................................. 24

3.5 Creating variable and value labels .................................................................................................... 24

3.6 Creating and modifying variables ..................................................................................................... 27

3.6.1 Generate and Replace ............................................................................................................ 27

3.6.2 Renaming variables ................................................................................................................ 29

3.6.3 Recoding variables .................................................................................................................. 29

3.7 Creating Scales .................................................................................................................................. 30

3.8 Quick Review ..................................................................................................................................... 30

3 | P a g e

3.9 Exercises ............................................................................................................................................ 31

Section 4: Working with commands, do-files, and results .................................................................. 32

4.1 Introduction ...................................................................................................................................... 32

4.2 How Stata commands are constructed ............................................................................................. 32

4.3 Creating and Saving a do-file ............................................................................................................ 35

4.4 Copying your results to a word processor ........................................................................................ 39

4.5 Quick Review ..................................................................................................................................... 39

4.6 Exercises ............................................................................................................................................ 39

Section 5: Descriptive Statistics and Graphs ...................................................................................... 40

5.1 Introduction ...................................................................................................................................... 40

5.2 Where is the center of a distribution? .............................................................................................. 41

5.3 How dispersed is the distribution? ................................................................................................... 41

5.4 Statistics and graphs – Unordered categories .................................................................................. 43

5.5 Statistics and graphs- Ordered categories and variables .................................................................. 49

5.6 Statistics and graphs-quantitative variables ..................................................................................... 51

5.7 Cross-tabulation for two categorical variables ................................................................................. 55

5.8 Chi-squared test ................................................................................................................................ 56

5.9 Quick Review ..................................................................................................................................... 58

5.10 Exercises .......................................................................................................................................... 58

Bibliography .................................................................................................................................... 60

4 | P a g e

Section 1: Getting Started

1.1 Convention:

Listed below are the conventions that are used throughout this tutorial.

Typewriter font. This tutorial will use this font when something would be meaningful to

Stata as input. It will also be used in the tutorial to indicate Stata’s output.

This tutorial will use a typewriter font to indicate the text to type in the Command window.

Because Stata commands do not have any special characters at the end, any punctuation mark

at the end of a command in this tutorial is not part of the command.

This tutorial also uses the typewriter font for variable names and for names of datasets. In

general, this tutorial uses the typewriter font whenever the text is something that can be typed

in Stata Command window or when the text is something that Stata might print as output.

1.2 Executing a Command

After you type a command you need to execute the command. Press ENTER on your

keyboard. We may not mention “press ENTER” in this tutorial after every command;

however, you have to press ENTER to execute that command.

1.3 Working along with the tutorial:

We cannot say it too often: the only way to learn how to analyze data is to analyze data

yourself. We strongly recommend that you reproduce our examples in Stata as you read this

tutorial. A line that is written in this font and begins with a period represents a Stata

command, and we encourage you to enter that command in Stata. Typing the commands and

seeing the results will help you better understand the text, since we sometimes omit output to

save space.

What does the dot prompt before a command mean?

When we show a listing of Stata commands, we place a dot and a space in front of each

command. When you enter these commands in the Command window, you enter the

command itself without the dot prompt or the space before the command. We include

these because Stata always shows commands this way in the Result window.

5 | P a g e

1.4 The Stata screen

Figure 1.4.1. The Stata Screen

When you open a file that contains Stata data, which we will call a Stata dataset, a list of

variables will appear in the Variable window. The Variable window reports the name of the

variables (for example, abortion), a label for the variable (for example, Attitude toward

abortion), the type of variable (for example, float), and the format of the variable (for

example, %8.0g).

When Stata executes a command, it prints the results or output in the Results window. First,

it prints the command preceded by a . (dot) prompt, and then it prints the output. The

commands you run are also listed in the Review window. If you click on one of the

commands listed in the Review window, it will appear in the Command window. If you

double-click on one of the command listed in the Review window, it will be executed. You

will then see the commands and its output, if any, in the Results window.

6 | P a g e

Using an existing dataset

Section 2 discusses how to create your own dataset, save it, and use it again. We will also

upload datasets for the class on blackboard and we recommend that you create a new folder

called “Stata” in C:\ and save those datasets in that folder. Therefore the location of your

datasets will be C:\Stata. For now, we will use a simple dataset, cancer.dta posted on the

blackboard. Download the file cancer.dta and save it in C:\Stata. Click once in the

Command window to put the cursor there, and then type the command use

“C:\Stata\cancer.dta”, clear. The Command window should look like the one in

figure 1.5.1. Then press ENTER on your keyboard.

Figure 1.5.1. Stata command to open cancer.dta

The command use filename, clear reads a previously saved dataset. If filename is

specified without an extension, Stata assumes it to be .dta. If your filename contains

embedded spaces, remember to enclose it in double quotes. Now that we have some data read

into Stata, type describe in the Command window and press ENTER. The command

describe will produce a brief description of the contents of the dataset.

. describe Contains data from C:\Stata\cancer.dta obs: 48 Patient Survival in Drug Trial vars: 8 21 Oct 2010 10:23 size: 768 (99.9% of memory free) storage display value variable name type format label variable label studytime int %8.0g Months to death or end of exp. died int %8.0g 1 if patient died drug int %8.0g Drug type (1=placebo) age int %8.0g Patient's age at start of exp. Sorted by:

The description includes a lot of information: the full name of the file, cancer.dta

(including the path used to locate the file); the number of observations (48); the number of

variables (8); the amount of memory the data consume (768 bytes) and how much of Stata’s

memory is still available (99.9% of memory free); a brief description of the dataset

(Patient Survival in Drug Trial); and the date the file was last saved (21 Oct 2010

10:23). The body of the table displayed shows the names of the variables on the far left and

the labels attached to them on the far right. We will discuss the middle columns later.

Now that you have opened cancer.dta, note that the Variables window lists the four

variables studytime, died, drug, and age.

7 | P a g e

1.5 An example of a short Stata session

If you do not have cancer.dta loaded, type the command use “c:\Stata\cancer.dta”,

clear. We will execute a basic Stata analysis command. Type summarize in the Command

window and then press ENTER.

In the Results window, the summarize command will display the number of observations

(Obs, also called cases or N), the mean, the standard deviation, minimum value, and the

maximum value for each variable. The output from summarize is useful when you have

continuous variables and not categorical. Example, mean and standard deviation of gender

does not make sense.

. summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- studytime | 48 15.5 10.25629 1 39 died | 48 .6458333 .4833211 0 1 drug | 48 1.875 .8410986 1 3 age | 48 55.875 5.659205 47 67

The first line of output displays the dot prompt followed by the command. After that, the

output appears as a table. As you can see, there are 48 observations in this dataset.

Observations is a generic term. These could be called participants, patients, subjects,

organizations, cities, or countries depending on your field of study. In Stata, each row of data

in a dataset is called an observation. The average, or mean, age is 55.875 years with a

standard deviation of 5.659, and the subjects are all between 47 (the minimum) and 67 (the

maximum) years old.

You might want to select specific variables to summarize instead of summarizing them all.

You can do this either by typing the name of the variables following summarize or just type

summarize and click on the variable name in the Variables window. For example, typing

summarize studytime age will display only statistics for the two variables named

studytime and age.

We will do one more thing in this Stata session: we will make the histogram for the age

variable, shown in figure 3.

Copying and Pasting your Output from Stata Result window to Word document

You can copy the output from the Result window. For this you need to highlight (select)

the output you need to copy, right click and select Copy Text. Then paste it into your

word file by using Ctrl+v. You will probably need to change the font into Courier

New and reduce its font size (for example, to 9 points) after pasting it to prevent the

lines from wrapping.

8 | P a g e

Figure 3. Histogram for age

A histogram is just a graph that shows the distribution of a variable, such as age, that takes on

many values. Simple graphs are simple to create. Just type the command histogram age in

the Command window, and Stata will produce a histogram using reasonable assumptions.

At first glance, you may be happy with this graph. Stat used a formula to determine that six

bars should be displayed, and this is reasonable. However, Stata starts the lowest bar (called a

bin) at 47 years old, and each bin is 3.33 years wide (this information is displayed in the

Results window) even though we are not accustomed to measuring years in third of a year.

Also notice that the vertical axis measure density, but we might prefer that it measure

frequency, that is the number of people represented by each bar.

Below you will find a command to improvise the histogram (FIGURE 4). In reading this

command you will want to ignore the opening ‘dot’ (Stata prints this in front of commands in

the Results window, but the ‘dot’ is not part of the command and you do not type it). Stata

prints > sign at the start of the second and third line, which might be confusing. Stata uses the

ENTER key to submit a command. Because of this, Stata sees the entire command as one

line. To print the entire line in confines of the Results window, Stata inserts the > for a line

break. If you wanted to enter this command in the Command window, simply type the entire

thing without the > and let Stata do the wrapping as need in the Command window. Never

press ENTER key until you have entered the entire command.

. histogram age, width(2.2) start(45) frequency title(Age Distribution of > Participants in Cancer Study) note(Data: Sample cancer dataset) > legend(on) scheme(s1mono)

0

.02

.04

.06

.08

De

nsity

45 50 55 60 65Patient's age at start of exp.

9 | P a g e

Figure 4. Improved Histogram for age

If you do want to enter a long command in the Command window, remember to type it as

one line. Whenever you press ENETR, Stata assumes that you have finished the command

and are ready to submit it for processing.

To finish our Stata session, we need to close Stata. Type exit in the Command window and

press ENTER.

1.6 Quick Review

In this section we covered:

The font and punctuation conventions that will be used throughout this tutorial

The Stata windows

How to open a sample Stata dataset

How to summarize the variables

How to create and modify a simple histogram

1.7 Exercises

1. For this exercise you need a dataset “nhanes.dta” that is posted on Blackboard. Save

nhanes.dta in the folder c:\Stata. Open nhanes.dta. Run two commands, describe

and summarize and answer following questions. [Hint: To continue with more results in

the Results window press SPACEBAR for the next page of the result or press any other

keys, except “q” for the next line of the result. Pressing “q” will quit the results.]

05

10

Fre

qu

en

cy

45 50 55 60 65Patient's age at start of exp.

Frequency

Data: Sample cancer dataset

Age Distribution of Participants in Cancer Study

10 | P a g e

a. How many observations and variables are there in the dataset nhanes.dta?

b. What are the names of the variables whose labels are Marital Status, Annual

household income, and Smoked at least 100 cigarettes in life?

c. What are the variable labels for the variables riagendr, smd630, bmxwt, and

bmxht?

d. What is the mean age of the respondent? (HINT: You can get the mean age of the

respondent using the variable ridageyr). Could you tell us the range of the age in

the distribution (Note: Range=Maximum observation-Minimum Observation)

e. Can you get the mean of the variable riagendr? Explain.

f. Show the number of observations, the mean, the standard deviation, minimum

value, and the maximum value for the two variables: bmxwt and bmxht

11 | P a g e

Section 2: Entering data

2.1 Introduction

This section shows how to create a dataset and enter data into the dataset.

2.2 Creating a dataset

In this section, you will learn how to create a dataset. Data entry and verification can be

tedious, but these tasks are essential for you to get accurate data.

Stata almost always requires data that are set in a grid, like a table, with rows and columns.

Each row contains data for an observation (which is often a subject or a participant in a

study), and each column contains the measurement of some variable of interest about the

observation, for example, age, race, gender, etc.

There is more to a dataset than just a table of numbers. Datasets usually contain labels that

help the researcher use the data more easily and efficiently. In a dataset, the columns

correspond to variables, and variables must be named. We can also attach to each variable a

descriptive label (e.g. smoking status), which will often appear in the output of statistical

procedures. Since most data will be numbers, we can also attach value labels to the numbers

to clarify what the numbers mean (e.g. “smokers” for 1 and “non-smokers” 0).

It is extremely helpful to pick descriptive names for each item. For example, you might call a

variable that contains people’s responses to a question about their state’s schools q23, but it

is often better to use a more descriptive name, such as schools. If there were two questions

Variables and Items

In working with questionnaire data, an item from the questionnaire almost always

corresponds to a variable in the dataset. So if you ask for a respondent’s age, then you

will have a variable called age. Some questionnaire items are designed to be combined

to make scales or composite measures of some sort, and new variables will be created

to contain those items (e.g. bmi), but there is no single item corresponding to the scale

or construct on the questionnaire. The terms item and variables are often used

interchangeably, but they are not synonyms. An item always refers to one question. A

variable may be the score on an item or a composite score based on several items.

12 | P a g e

about schools, you might want to name them as schools1 and schools2 rather than q23a

and q23b. If a set of items are to be combined into scales and are not intended to be used

alone, you might want to use names that correspond to the questionnaire items for the

original variables and reserve the descriptive names for the composite scores. This is useful

with complex datasets where several people will be using the same dataset. Each user will

know that q23a refers to question 23a whereas descriptive names like “schools1” may make

sense to one user but not to other user. No matter what the logic of your naming, try to keep

names short for you to save time while typing the variable several times during your analysis.

The name age is more preferred than age_of_respondents. In addition, some of Stata’s

tabular output is designed for short names and the longer names will be truncated.

Even with relatively descriptive variable names, it is usually helpful to attach a longer and

more descriptive label to each variable. We call these variable labels to distinguish them

from variable names. For example, you might want to label the variable smokestatus with

the label “Current smoking status”. The variable label gives us a clearer understanding of the

data stored in that variable.

For some variables, the meaning of the values in the data is obvious. If you measure people’s

height in meter, then when you see the values; it is clear what they mean. For other variables,

the meaning of the values needs to be specified with value labels. For example, responses

“Yes”, “No” are coded as 1 and 0 respectively, and it can make understanding tables of data

much easier if you create value labels to be displayed in the output in addition to (or instead

of ) the numbers.

2.3 An example of questionnaire

We have discussed datasets in general, so now let’s create one. Suppose that we conducted a

telephone survey of 20 people and asked each of them nine questions, which are shown in the

example questionnaire below. Our task is to convert the questionnaire’s answers into a

dataset that we can use with Stata.

13 | P a g e

Example Questionnaire:

Tobacco Survey

1. What is your gender?

___Female ___Male

2. What is your current age in years?

______

3. How many years of education have you completed?

___ 0-8 ___ 9-10

___11-12 ___13-16

___16-19 ___20+

4. Have you smoked at least 100 cigarettes in your entire lifetime?

___Yes ___No (if “No” thank the respondent and end the interview)

5. How old were you when you first started smoking cigarettes fairly regularly?

____ Years (if never smoked regularly enter “X” and GO TO 6)

(Note: “X” will be treated as missing in data entry process)

6. Do you now smoke cigarettes every day, some days, or not at all?

___ Every day ___ Some days ___ Not at all ___No response

(If “Every day” then GO TO 7, ELSE thank the respondent and end the interview)

7. On the average, about how many cigarettes do you now smoke each day?

____ Number of cigarettes

8. How soon after you wake up do you typically smoke your first cigarette of the day?

___ Hours ___Minutes

9. How many of your friends smoke?

___ None

___1-5 friends

___ 6-10 friends

___>10 friends

___Don’t Know

___Refused to answer

14 | P a g e

2.4 Developing a coding system

Statistics is most often done with numbers, so we need to have a numeric coding system for

the answers to the questions. Stata can use numbers or words. For example, we could type

Female if a respondent checked it. However, it is usually better to use some sort of numeric

coding, so you might type 1 if the respondent checked the option Male on the questionnaire

and 2 if the respondent checked Female. We will need to assign a number to enter for each

possible response for each of the items on the survey.

You will also need a short variable name for the variable that will contain the data for each

item. Variable names can contain uppercase and lowercase letters, numerals, and the

underscore character, and they can be up to 32 characters long. No blank spaces are allowed

for variable names. The variable name mother age would be interpreted as two variables,

mother and age. Generally you should keep your variable names to 10 characters or fewer,

but 8 or fewer is best. Variable names should start with a letter.

If appropriate you should explain the relationship between any numeric codes and the

responses as they appeared on the questionnaire. For an example, see the example codebook

(not to be confused with the Stata command codebook, which we will use later) for our

questionnaire that appears in table 2.4.1.

15 | P a g e

Table 2.4.1. Example codebook

Question Variable

name

Variable

labels

Value labels Code

Identification number id Record in order ENTER ID# 1 to 20

What is your gender? gender Gender of respondents

Male 1 Female 2

What is your current age in

years?

age Age of respondents

ENTER AGE IN YEARS

How many years of

education have you

completed?

education Highest level of educational attainment

0-8 8 9-19 9 to 19 20+ 20

No response -9

Have you smoked at least

100 cigarettes in your entire

lifetime?

smoked100 Smoked at least 100 cigarettes in life

Yes 1 No 0

No response -9

How old were you when

you first started smoking

cigarettes fairly regularly?

initiation Age at initiation

ENTER AGE IN YEARS

No response -9

Do you now smoke

cigarettes every day, some

days, or not at all?

smokestatus Current smoking status

Every day 1 Some days 2 Not at all 3 No response -9

On the average, about how

many cigarettes do you now

smoke each day?

numcig Number of cigarettes per day

ENTER NUMBER

No response -9

How soon after you wake

up do you typically smoke

your first cigarette of the

day?

firstcig Time to first cigarettes of a day (in minutes).

ENTER TIME in MINUTES

No response -9

How many of your friends

smoke?

peersmoke Number of friends who smoke

None 0 1-5 friends 1 6-10 friends 2 >10 friends 3 No response -9

A codebook translates the numeric codes in your dataset back into the questions you asked

your participants and the choices you gave them for answers. Regardless of whether you

gather data with a computer-assisted interviewing system or with paper questionnaires, the

codebook helps to make sense of your data and your analyses. If you do not have a

codebook, you might not realize that everyone with eight or fewer years of education is

coded the same way, with an 8. That may have an impact on how you use that variable in

later analyses.

We have added an id variable to identify each respondent. In simple cases, we just number

the questionnaire sequentially. If we have a sample of 5,000 people, we will number the

questionnaires from 1 to 5,000, write the identification number on the original questionnaire,

and record it in the dataset. If we discover a problem in our dataset (e.g. somebody with a

16 | P a g e

coded value of 3 for gender), we can refer back to the questionnaire and determine the

correct value.

Some data will be missing -- people may refuse to answer certain questions, interviewers

forget to ask questions, equipment fails to record a measurement, etc. and we need a code to

indicate that the data is missing. If we know why the response is missing, we will record the

reason too, so we may want to use different codes that correspond to the different reasons the

data are missing.

On surveys, respondents may refuse to answer, may not express an opinion, or may not have

been asked a particular question because of the answer to an earlier question. Here we might

code “refused to answer” as -4, “don’t know” as -3, “valid skip” as -2, and “missing for any

other reason” as -1. We should pick values that can never be a valid response. In this session

we will leave the cell “blank” for the missing values.

We will be entering the data ourselves, so after we administered the questionnaire to our

sample of 20 people, we prepared a coding sheet that will be used to enter the data. The

coding sheet originates from the days when data were entered by professional keypunch

operators, but it can still be useful. When you create a coding sheet, you are converting the

data from the format used on the questionnaire to the format that will actually be stored in the

computer (a table of numbers). The more the format of the questionnaire differs from a table

of numbers, the more likely it is that a coding sheet will help prevent errors.

In general, if you transcribe the data from the questionnaires to the coding sheet, you will

need to decide which responses go in which columns. Deciding this will reduce errors from

those who perform the data entry, who may not have the information needed to make those

decisions properly.

Whether you enter the data directly from the questionnaires or create a coding sheet will

depend largely on the study and on the resources that are available to you. Some experience

with data entry is valuable because it will give you a better sense of the problems you may

encounter, whether you or someone else enters the data; so, for our example questionnaire,

we have created a coding sheet shown in table 2.4.2

17 | P a g e

Table 2.4.2. Example coding sheet id gender age education smoked100 initiation smokestatus numcig firstcig peersmoke 1 2 25 8 1 11 1 15 15 3 2 1 27 9 0 3 2 29 8 0 4 1 35 12 1 12 1 12 30 2 5 2 54 9 1 12 1 15 60 2 6 2 43 8 1 10 2 7 2 47 16 1 10 1 7 120 0 8 1 26 8 0 9 1 19 10 1 9 3 10 1 20 9 1 10 1 14 10 2 11 2 19 11 1 11 1 13 10 3 12 2 17 12 1 10 2 13 2 26 15 0 14 1 24 15 1 9 1 10 90 2 15 1 23 16 1 11 1 11 180 2 16 1 22 10 1 12 1 3 240 1 17 2 21 13 1 13 1 9 60 1 18 1 20 13 1 12 3 19 2 41 14 1 15 2 20 1 38 19 1 16 1 20 5 3

Because we are not reproducing the 20 questionnaires in this tutorial, it may be helpful to

examine how we entered the data from one of them. We will use the 5th questionnaire. We

have assigned an id of 5, as shown in the first column. Reading from left to right, in the

second column, for gender, we have recorded a 2 to indicate a woman. The woman is 54

years old with 9 years of education. She has smoked at least 100 cigarettes in her lifetime

(smoked100=1). She started smoking fairly regularly from the age of 12 and is a dailysmoker

(smokestatus=1). In a day she smokes around 15 cigarettes and 1 hour (60 minutes) after

she wakes up in the morning she smokes her first cigarette. She has 6-10 friends who are

smokers (peersmoke=2). [Note: Refer to Table 2.4.1 for the codes of particular value labels.]

2.5 Entering data

We will use Stata’s Data Editor to enter our data from the coding sheet in table 2.4.2. The

Data Editor provides an interface similar to that of a spreadsheet, but it has some features that

are particularly suited to creating Stata datasets. Before opening the Data Editor, you might

want to save any open files and then enter the command clear in the command window.

This step will give you a fresh Data Editor in which to enter data. Enter the clear command

only if you want to start with a new dataset that has nothing in it. Type edit in the Command

Window. The Data Editor window is shown in figure 2.5.1.

18 | P a g e

Figure 2.5.1 Data Editor Window

Data are entered in the white columns, just as in other spreadsheets. Here we will only enter

the data for the first respondent to the example survey. In the first white cell under the first

column, enter the identification number of the first respondent, which is a 1. Press the Tab

key to move across to the next cell, and enter the value from the second column of the coding

sheet, which is a 2 because the first case is woman. Keep entering the values and pressing

Tab until we have entered all the values for the first participant. After we have entered the

number from the last number in the first row of the coding sheet, press Enter, which will

move the cursor to the second row of the Data Editor. Press the Home key to move to the

first column of this row.

After we have the data entered for the first respondent, let us modify the current variable

names (var1, var2,…) according to our codebook. Double-click on the gray cell at the top

of the first column, which now contains the variable name var1. Double-clicking on this cell

opens the variable properties dialog box. Figure 2.5.2 shows how we can use this dialog box

to change the variable name to id and the label to Record in order. Use TAB key to jump

from ‘Name’ to ‘Label’ boxes.

Figure 2.5.2 Variable name and variable label

19 | P a g e

Click Apply and then in the Data Editor Window click Var2 and name it as gender and its

label as Gender of Respondents. Click Apply and continue renaming and labeling all the

remaining generic variable names to rename and label them as listed in table 2.4.1. Close the

dialog box.

Enter values of all the variables for the remaining respondents using the codebook 2.4.2.

Click the Save icon and close the Data Editor window.

All our data here are numeric, none of the data contain decimals, and none are wider than

eight digits, so we can leave the format alone. Once you have defined a variable as numeric,

Stata will warn you if you try to enter data that are not numeric, which can help reduce

errors.

2.6 Saving your dataset

If you look at the Results Window in Stata, you can see that the dialog box has done a lot of

work for you. The Results window shows that a lot of commands have been run to rename

Editing Variables

Using the data editor’s variable properties dialog box we can modify the variables that

we are interested in. Type edit in the Command window and run it. We will be

prompted to Data Editor window. Double click on the variable we want to modify and

we can change the name of the variable and re-label it. Close the dialog box and Click

Save icon in the Data Editor window and close it.

Recoding Variables

For example, we are interested to recode age variable into 3 groups: 17-29 years, 30-

39 years, and 40+ years. For this we need to assign codes 1, 2, and 3 for each of these

age groups under the new variable agegroup in our coding sheet. The observations

with id 1, 2, 3, 4, … would have a recoded age under the agegroup respectively as 1, 1,

1, 2, … . This will be our modified coding sheet. To create a new variable as a recode

of an existing variable open the Data Editor window and enter the recoded value of the

first observation in an empty cell where we want our new recoded variable. Then

double click the new variable created and modify its Name and Label. Close the

variable properties dialog box and complete the entries under the new variable for all

the observations from our modified coding sheet. Click Save icon on the Data Editor

window. Close the Data Editor window. You will learn more efficient way to do this in

later section.

20 | P a g e

and label the variables. These also appear in the Review window. The Variables window lists

all of your variables. We have now created our first dataset. Let’s save this dataset by the file

name firstsurvey:

. save “C:\Stata\firstsurvey.dta”, replace

2.7 Checking the data

We have created a dataset, and now we want to check the work we did defining the dataset.

Checking for the accuracy of our data entry is also our first statistical look at the data. To

open the dataset type in the Command window

. use “C:\Stata\firstsurvey.dta”, clear

Let us run a couple of commands that will characterize the dataset and the data stored in it in

slightly different ways. We created a codebook to use when creating our dataset. Stata

provides a convenient way to reproduce much of that information, which is useful if you

want to check that you entered the information correctly:

. codebook gender -------------------------------------------------------------------------- gender Gender of respondents -------------------------------------------------------------------------- type: numeric (byte)

range: [1,2] units: 1 unique values: 2 missing .: 0/20

tabulation: Freq. Value 10 1 10 2

Let us go over the display for gender. The first line lists the variable name, gender, and the

variable label, Gender of respondents. Next the type of the variable, which is numeric

(byte), is shown. The range of this variable (shown as the lowest value, then the highest

value) is from 1 to 2, there are two unique values, and there are 0 missing values in the 20

cases. The information is followed by a table showing the frequencies, and values (If you

label the values using the method described in session 3, here you will also get labels). We

have 10 cases with a value of 1 and 10 cases with a value of 2.

Using the codebook is an excellent way to check your categorical variables, such as gender

and smokestatus. If, in the tabulation, for gender you saw any values other than 1 and 2 or

for smoker you saw any values other than 1, 2, 3, and . you would know that there are

21 | P a g e

errors in the data [‘.’ in the dataset represents missing value. There are other ways to

represent missing values too.]. Looking at these kinds of summaries of your variables will

often help you detect data-entry errors, which is why it is crucial to review your data after

entry.

2.8 Quick Review

In this section we covered

Creating a dataset

An example of questionnaire

Developing a coding system

Entering data

Saving your dataset

Checking the data

2.9 Exercises

1. For this exercise you need the dataset that you created firstsurvey.dta.

a. Using the Data Editor window, recode education into a new variable edurec

with three groups: > 13 years of education, 13-16 years of education, and More

than 17 years of education (Use Table 2.4.2 to recode education into the new

variable).

b. Generate a codebook for edurec and display the output.

c. Save your dataset into a new dataset and call it mysurveytrunc.dta.

22 | P a g e

Section 3: Preparing for data analysis

3.1 Introduction

This section shows how to prepare a dataset ready for analysis.

3.2 Planning your work

Most of the time spent on any research project is getting your data prepared for analysis. We

will cover a few basic steps in preparing your dataset. The data we will be using in this

tutorial and class are from different sources: National Health and Nutrition Examination

Survey (NHANES), Behavior Risk Factor Surveillance System (BRFSS), and Tobacco Use

Supplement to the Current Population Survey (TUS-CPS).

It is useful to create an outline of the steps needed to go from data collection to analysis. The

outline should include what needs to be done to collect or obtain the data, read and label the

data, make any necessary changes to the data, create any composite variables (sometimes

referred to scales), and finally create an analysis-ready version of the dataset. Our project

outline, which is for a small project, is as follows:

Consult NHANES/BRFSS/TUS-CPS documentation to determine the type of

variables needed.

Download the data and look for codebooks or descriptions of the variables.

Create a basic Stata dataset (e.g. cancer.dta, nhanes.dta): We have uploaded

these datasets in Blackboard.

Create variable and value labels.

Generate missing-value codes to Stata missing values.

Reverse code those variables that need it; verify.

Copy variables not reversed to named variables; verify.

Create the scale variable.

Save the analysis-ready copy of the dataset.

Note: The *.dta form of the datasets are uploaded in Blackboard. All of these datasets

are originally downloaded in different formats from the links of the respective surveys

and were later processed and saved into .dta format, ready for use in Stata. Here, *

means filename and .dta is its extension that determines the type of file. Remember that

you should download all your datasets from Blackboard into a folder “Stata” that you

created in C:\.

23 | P a g e

3.3 Loading data and inputting commands

Before starting to work with your dataset, you should read it with Stata. In order to do this,

first download nhanes.dta from Blackboard to your Stata folder, then type use

“C:\Stata\nhanes.dta”, clear at the Stata command line and press ENTER.

Remember Stata is case sensitive and you must preserve uppercase and lowercase letter. For

example, if you type Describe instead of describe in Command window then it will be

meaningless to Stata and yields an error message. In Command window, remember to avoid

periods that appear before the words and/or any other punctuation marks that appear after the

words throughout this tutorial.

Once you load your dataset you can look at its contents by typing describe. You should see

the output displayed below. The displayed output is a truncated output (Only a part of the

output is displayed):

. describe Contains data from C:\Documents and Settings\Mohammad\Desktop\Lava_Stata Tutorial\nhanes.dta obs: 1,100 vars: 62 9 Dec 2010 14:26 size: 550,000 (94.8% of memory free) -------------------------------------------------------------------------- variable storage display value variable name type format label label -------------------------------------------------------------------------- seqn double %12.0g Respondent sequence number riagendr double %12.0g Gender ridageyr double %12.0g Age at Screening Adjudicated - Recode ridagemn double %12.0g Age in Months at Screening - Recode ridageex double %12.0g Age in Months at Exam - Recode ridreth1 double %12.0g Race/Ethnicity – Recode

(Rest of the output is omitted)

Note that the dataset uploaded on Blackboard are the random subsample of all

observations. They are not the complete dataset for the respective survey and will be

used only for the purpose of this class.

Note that Every time you see word(s) typewriter font then you should type the

word(s) in the Command window and press ENTER.

24 | P a g e

In general, describe provides information about the number of variables and the number of

observations in your dataset. describe also indicates the percentage of the working memory

(RAM) allocated to Stata. It also gives detail information about a variable, storage type of the

variable, display format, value label, and variable label.

3.4 Looking at data

Using the command list, we get a closer look at the data. The command lists all the

contents of the data file. You can look at each observation by typing

. list +-----------------------------------------------------------------+ 1. | seqn | riagendr | ridageyr | ridagemn | ridageex | ridreth1 | | 45040 | 2 | 9 | 118 | 118 | 1 | |-------------------------------------- -----------------------|

|dmdmartl | dmdhhsiz | indhhin2 | duq200 | duq210 | duq220q | . | 4 | 2 | . | . | . |

+-----------------------------------------------------------------+

(Rest of the output is omitted)

You will see a period (.) for those variable with no information recorded. Stata calls it a

“missing value” or just “missing”. In Stata, a period or a period followed by any character a

to z indicates a missing value. Later in this manual we will discuss how to define missings

and how to handle missing values in Stata.

Scrolling through a large number of observations is tedious, so using list command is not

very helpful with a large dataset. Even with a small dataset, list can display too much

information to process easily. However, sometimes you can take a glance at the first few

observations to get a first impression or to check on the data. To scroll down to next page in

the result when you see –more- press SPACEBAR or if you only want to go to next line

press any key. After observing a few observations, you might want to stop listing and avoid

scrolling to the last observation. You can stop the printout by pressing q, for quit. Anytime

you see –more- on the screen, pressing q will stop listing results.

3.5 Creating variable and value labels

Now let’s add labels to the variables and their values. There are essentially three steps in

labeling:

i. Create a label for the variable itself

ii. Define a label for values

iii. Attach that label to a specific variable.

25 | P a g e

We first label our variable riagendr:

. label variable riagendr “Sex of Respondents”

Now the variable riagendr has a new label “Sex of Respondents”. You can see the new

label in the Variables window. If you use the command tab riagendr you will see this new

label in the output (note that tab is the short form used for tabulate). The command tab

generates frequency distribution of a variable. . tab riagendr Sex of | respondents | Freq. Percent Cum. ------------+----------------------------------- 1 | 544 49.45 49.45 2 | 556 50.55 100.00 ------------+----------------------------------- Total | 1,100 100.00

However, you will notice that no labels for the values have been defined yet. Now we want

to label the values of the categories of riagendr. To do this use the command label define.

First, we have to define a name for our labels. Let us call it ‘gender’.

. label define gender 1 “male” 2 “female”, modify

At this point, you have only defined a label but have not attached it to the variable. If you

type label list you will see all the labels in your dataset. You will find the label for

‘gender’ with all the category values defined.

Now we must attach these category labels to the variable:

. label value riagendr gender

Note, that the “gender” in the label define and label value commands is an arbitrary word. It

can be any word as long as it is the same word in the two commands. Now if we tabulate the

variable riagendr we get the categories of the variable that were defined above:

. tab riagendr Sex of | respondents | Freq. Percent Cum. ------------+----------------------------------- male | 544 49.45 49.45 female | 556 50.55 100.00 ------------+----------------------------------- Total | 1,100 100.00

26 | P a g e

Let us see another example with the variable dmdmartl (marital status):

First label the variable dmdmartl

. label variable dmdmartl “Marital Status of Respondents” Now the output for tabulate dmdmartl will give . tab dmdmartl

Marital | Status of | Respondents | Freq. Percent Cum. ------------+----------------------------------- 1 | 323 51.35 51.35 2 | 59 9.38 60.73 3 | 71 11.29 72.02 4 | 26 4.13 76.15 5 | 108 17.17 93.32 6 | 42 6.68 100.00 ------------+----------------------------------- Total | 629 100.00

Now define labels and attach them with the variable:

. label define mar 1 "Married" 2 "Widowed" 3 "Divorced" 4 "Separated" 5 "Never married" 6 "Living with partner" . label value dmdmartl mar

New output for tab dmdmartl is

. tab dmdmartl

Marital Status of | Respondents | Freq. Percent Cum. --------------------+----------------------------------- Married | 323 51.35 51.35 Widowed | 59 9.38 60.73 Divorced | 71 11.29 72.02 Separated | 26 4.13 76.15 Never married | 108 17.17 93.32 Living with partner | 42 6.68 100.00 --------------------+----------------------------------- Total | 629 100.00

Refer to http://www.cdc.gov/nchs/nhanes/nhanes2007-2008/nhanes07_08.htm to get the

correct value labels for the different categories of each variable in the dataset. For

example, Click Demographics and then go to Docs under Demographic Variables and

Sample Weights and on the right hand side you will see the table of contents with list of

demographic variables. Click the riagendr and you will notice that 1 is male and 2 is

female. Use these codes in defining the value labels. Similarly, click the dmdmartl and

you will notice the value descriptions we used earlier in defining the value labels for

this variable.

http://www.cdc.gov/nchs/nhanes/nhanes2007-2008/nhanes07_08.htm

27 | P a g e

3.6 Creating and modifying variables

There are several commands that are used in creating, replacing, and modifying variables.

Some useful commands include generate, replace, rename, and recode.

3.6.1 Generate and Replace: generate creates a new variables, whereas replace changes the

contents of an existing variable. To ensure that you do not accidentally lose data, you

cannot overwrite an existing variable with generate and you cannot generate a new

variable with replace. The command syntax for both is same: you specify the name of the

command, followed by the name of the variable to be created or replaced. Then you place

an equal-sign after the variable name and specify an expression to be created or replaced.

You can create a new variable:

. generate agegrp=.

This will generate a new variable called agegrp with missing values for all observations.

When you use the command tab agegrp you will see no observations.

. tab agegrp no observations

replace changes the content of a variable. Below we change the content of the variable

agegrp to 9:

. replace agegrp=9

This will replace all the missing values of agegrp into 9.

Now when you tab agegrp you will see that the missing values have been replaced by 9

for the variable.

. tab agegrp agegrp | Freq. Percent Cum. ------------+----------------------------------- 9 | 1,100 100.00 100.00 ------------+----------------------------------- Total | 1,100 100.00

28 | P a g e

Now let us modify the ridageyr into the categorical variable agegrp. We have already

created agegrp. Now let us categorize age as agegrp into 0-17 years, 18-49 years and

50+ years.

. replace agegrp=1 if (ridageyr>=0 & ridageyr<18) . replace agegrp=2 if (ridageyr>=18 & ridageyr<=49) . replace agegrp=3 if ridageyr>49

Based on the variable ridageyr (age at screening), the above set of replace commands

will categorize the new variable agegrp into three levels: 0-17 years; 18-49 years; and

50+ years.

You can now generate frequency distribution of agegrp using the command . tab agegrp

[Output not shown]

Based on the previous example of labeling riagendr, you can now label the new variable

agegrp define its value labels and attach the labels to it. Re-generate its frequency

distribution table.

[Commands and Output not shown]

You can also generate a new variable from the existing variables using arithmetic signs

and other functions into expressions. Table 3.6.1.1 shows the arithmetic symbols that can

be used in expressions.

Naming variables: The names of the variables generated can be up to 32 characters

long. However, it is a good idea to keep the names concise (recommended not more

than 8 characters long) to save time when you have to type them repeatedly and work

with other statistical software packages. You cannot begin the names of the variables

with a number. Space(s) within the name of the variables is not allowed. You can build

your names with letter (A-Z and a-z), numbers (0-9), and underscores ( _ ). The

following names are not allowed:

_all double long _rc _b float

_n _skip byte if _N str#

_coeff in _pi using _cons int

_pred with

29 | P a g e

Table 3.6.1.1. Arithmetic Symbols

Symbol Operation Example

+ Addition mscore+fscore+sibscore

- Subtraction balance-expenses-penalty

* Multiplication income*0.75

/ Division expenses/income

^ Exponentiation (X2) X^2

Let us create a new variable bmi using the arithmetic function from the existing variables

bmxwt (weight in kg) and bmxht (Standing height in cm).

. generate bmi=bmxwt/((bmxht/100)^2)

This generate command creates a new variable called bmi which is defined as weight in

kilograms divided by the square of height in meters. This command can also be written

as:

. generate bmi=bmxwt/((bmxht/100)*(bmxht/100))

3.6.2 Renaming variables: If a variable already has a name you might want to change, then

you use rename command followed by the old name then a new name. For example,

. rename riagendr gender

The rename command is suitable when you only have a few variables to change as you

can only change one variable per rename command. However, in most circumstances,

using the generate command (e.g. generate gender=riagendr) would be preferred.

3.6.3 Recoding variables: You will often need to combine several values of one variable into a

single value. When you combine the values, we recommend you always generate a new

variable so that you could preserve the original variable. Recode can be used instead of

generate and replace command. Let us take the example of the age group created

above:

. recode ridageyr (min/17=1) (18/49=2) (50/max=3), gen(agegrprec)

With recode you assign new values to certain observations of a new variable according

to a coding rule. Using the generate( ) option stores the results in a new variable

instead of overwriting ridageyr. Here min refers to the lowest value of the variable

ridageyr, and max refers to the highest value. Note that the name of the new variable

agegrprec is arbitrary.

Compare the frequency distribution of agegrp and agegrprec.

30 | P a g e

3.7 Creating Scales

Let us develop a scale variable. We will construct one scale variable: Depression Severity

named as depsev using 9 items (dpq010, dpq020, dpq030, …, and dpq090) of the

depression screener. Each of these items is scored from 0 (not at all) to 3 (nearly every day).

Close observation of these variables in the codebook indicates that there are other values

assigned to each of these items and these values are either “7 = refused” or “ 8 = don’t

know”. Therefore, we create new variables for each of these items recoding missing values to

those values >3 or are missing:

. recode dpq010-dpq090 (0=0) (1=1) (2=2) (3=3) (else=.), gen (d1 d2 d3 d4 d5 d6 d7 d8 d9)

Then we create a new variable PHQ9Score which is the aggregate score of variables d1

through d9 each scored from 0 to 3. Therefore, the PHQ9Score ranges from 0 to 27 points.

Higher scores indicate a more severe depression.

. generate PHQ9Score=d1+d2+d3+d4+d5+d6+d7+d8+d9

Now let us generate a scale variable depsev by categorizing the PHQ9Score points such that

the points 0 denote “No Depression”, 1-4 “Minimal Depression”, 5-9 “Mild Depression”, 10-

14 “Moderate Depression”, 15-19 “Moderately Severe Depression”, and 20-27 “Severe

Depression”.

. recode PHQ9Score (0=0) (1/4=1) (5/9=2) (10/14=3) (15/19=4) (20/27=5), gen(depsev)

Check the variable depsev using a tab command:

.tab depsev

You may want to save your work as mynhanes.dta. You will need this dataset in later

sections and exercises.

. save “C:\Stata\mynhanes.dta”, replace

3.8 Quick Review


Inputting commands

Loading data

31 | P a g e

Variables and observations

Looking at data

Labeling variables

Creating and modifying variables

Creating scales

3.9 Exercises

1. Open firstsurvey.dta.

a. With the help of the variable education generate a new variable educate using

following criteria.

If education = 8 then educate = 1; and

If education = 9 to 19 then educate = 2, where 1 = 0-8 years, and 2 =

9-19 years.

b. Label the variable and its values appropriately.

c. Using tab command, display the output for the variable educate.

d. Generate a new variable mincigrec where mincigrec is the score for time to first

cigarettes in a day (in minutes) such that

If firstcig = 1-5 then minicigrec = 1;

If firstcig = 6-30 then minicigrec = 2;

If firstcig = 31-60 then minicigrec = 3; and

If firstcig > 60 then minicigrec = 0

e. Generate another new variable avgcig where avgcig is the score for average

number of cigarettes smoked per day such that

If numcig = 1-10 then avgcig = 0;

If numcig = 11-20 then avgcig = 1;

If numcig = 21-30 then avgcig = 2; and

If numcig >30 then avgcig = 3

f. Generate heaviness of smoking index (hsi) which the sum of the scores of the

variables firstcig and numcig created in 1d and 1e. Label the new variable and

display it using tab command.

g. Save the dataset with a new name “tobsurvey.dta”.

32 | P a g e

Section 4: Working with commands, do-files, and results

4.1 Introduction

This section shows how to use do-files, which are simple text files that contain series of Stata

commands, which are executed one after the other. Do-files allow you to replicate your work,

something you should always ensure you can do. When you collaborate with colleagues, they

can use your do-file as a way to follow exactly what you did. It is hard enough to remember

all the commands you create in a session, and if there is a delay between work sessions, it is

impossible to remember all those commands.

4.2 How Stata commands are constructed

Stata has many commands and is not limited to the following examples:

list List value of variables

summarize Summary statistics

describe Describe data in memory or in file

codebook Describe data contents

tabulate Tabulate frequencies

generate Create or change contents of variables

egen Extensions to generate

correlate Correlations (covariances) of variables or coefficients

ttest Mean-comparison tests

regress Linear regression

alpha Compute interitem correlations (covariances) and Cronbach’s alpha

graph The graph command

Stata has a remarkably simple command structure. Stata commands are all lower-case.

Virtually all Stata commands take the following form:

What is a command? What is a do-file?

A command instructs Stata to do something, such as construct a graph, frequency

tabulation, or table of correlations. A do-file is a collection of commands. It tells Stata

what to “do”. A simple do-file might open a dataset, summarize the variables, create a

codebook, and then do a frequency tabulation of the categorical variables. A do-file can

include all the commands you use to label variables and values; it can recode variables,

average variables, define how you treat missing values.

33 | P a g e

command varlist if/in, options

The command is the name of the command, such as summarize, generate, or tabulate.

The varlist is the list of variables used in the command. For many commands, listing no

variables means that the command will be executed on all variables. If we said summarize,

Stata would summarize all the variables in the dataset. If we said summarize age

education, Stata would summarize just age and education variables. The variable list

could include one variable or many variables.

After the variable list come the if and in qualifiers regarding what will be included in the

particular analysis. Suppose that we have a variable called gender. A code of 1 means that

the participant is a male, and a code of 2 means that the participant is female. We want

summary statistics for age and restrict the analysis to males. To restrict the analysis, we

would say summarize age if gender==1. Here we use two equal signs, which is the Stata

equivalent to the verb “is”. So the command means “Summarize age if gender is coded with

a value of 1”. Why the two equal signs? The statement gender=1 literally means that the

variable called gender is a constant value of 1, but males are coded as 1 and females as 2 on

this variable.

Sometimes we want to run a command on a subset of observations, and so we use the in

qualifier. For example, we might have the command summarize age education in

1/200, which would summarize age and education in the first 200 observations.

Each command has a set of options that control what is done and how the results are

presented. The options vary from command to command. One option for the summarize

command is to obtain detailed results, summarizing the variables in more ways. If we wanted

to do a detailed summary of scores on age and years of education for adult males, the

command would be

. summarize age education if gender==1 & age>17, detail

The command structure is fairly simple, which is helpful for us because it is absolutely rigid.

This example used the ampersand (&), not the word “and”. If we had entered the word “and”,

we would have received an error message. Here are more examples with if statements:

. summarize age education if gender==2 . summarize age education if gender==1 & age>64 . summarize age gender if gender==2 & age>64 & education==12

When you have missing values stored as . or .a, .b, etc., you need to be careful about using

the if qualifier. Stata stores missing values internally as huge numbers that are bigger than

any value in your dataset. If you had missing data coded as . or .a and entered the command

34 | P a g e

summarize age if age>64, you would include people who had missing values. The correct

format would be . summarize age if age>64 & age<.

The <. qualifier at the end of the command is strange to read (less than dot) but necessary.

Table 4.2.1 shows the relational operators available in Stata.

Table 4.2.1 Relational operators used by Stata

Symbol Meaning == is or is equal to != or ~= is not or is not equal to > is greater than >= is greater than or equal to < is less than <= is less than or equal to

Here are few Stata commands and the results they produce. You need your Stata to first read

the dataset. Let us use the dataset firstsurvey that you created in Section 2. You can enter

these commands in the Command window to follow along.

. use “c:\Stata\firstsurvey.dta”, clear . summarize

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 20 10.5 5.91608 1 20 gender | 20 1.5 .5129892 1 2 age | 20 28.8 10.57106 17 54 education | 20 11.75 3.274704 8 19 smoked100 | 20 .75 .4442617 0 1 -------------+-------------------------------------------------------- initiation | 16 11.4375 1.965324 9 16 smokestatus | 16 1.4375 .7274384 1 3 numcig | 11 11.72727 4.540725 3 20 firstcig | 11 74.54545 77.40977 5 240 peersmoke | 11 1.909091 .9438798 0 3 -------------+--------------------------------------------------------

This summarize command does not include a variable list, so Stata will summarize all

variables in the dataset. It has no if/in restrictions and no options, so Stata summarizes all

the variables, giving us the number of observations with no missing values, the mean, the

standard deviation, and the minimum and maximum values. The statistics for the id variable

are not useful, but it is easier to get these results for all variables than it is to list all the

variables in a variable list, dropping only id.

35 | P a g e

We can add the detail option to our command to give more detailed information. Do this

for just one variable.

. summarize initiation, detail Age at initiation ------------------------------------------------------------- Percentiles Smallest 1% 9 9 5% 9 9 10% 9 10 Obs 16 25% 10 10 Sum of Wgt. 16 50% 11 Mean 11.4375 Largest Std. Dev. 1.965324 75% 12 12 90% 15 13 Variance 3.8625 95% 16 15 Skewness .9398362 99% 16 16 Kurtosis 3.281962

As expected, this method gives us more information. The 50% value is the median age at

initiation, which is 11 years. We also get the values corresponding to other percentiles, the

variance, a measure of skewness, and a measure of kurtosis.

4.3 Creating and Saving a do-file

Stata has a simple text editor called a Do-file Editor in which you can enter a series of

commands. You can run all the commands in this file or just some of them. You can then

edit, save, and open the commands in a do-file at a later date. Saving these do-files means not

only that you can replicate what you did and make any needed adjustments but also that you

will develop templates you can draw on when you want to do a similar analysis. To open the

Do-file Editor window, select Window>DO-file Editor>New Do-file Editor. You can also

open the Do-file Editor by clicking on the toolbar icon that looks like a notepad; see figure

4.3.1 for the icon in Stata. Figure 4.3.2a shows the Do-file Editor window.

Figure 4.3.1 The Do-file Editor icon on the Stata Menu

Do-file

Editor

36 | P a g e

Figure 4.3.2a The Do-file Editor

Figure 4.3.2b Icons on the Do-file Editor Toolbar

When you click on another window, the Do-file Editor can be hidden by it. You can bring the

Do-file Editor to the front again by clicking on it in the system toolbar or by using the

Alt+Tab key combination to move through the open windows until the Do-file Editor is

highlighted. You can avoid having the window disappear by arranging your desktop so that

the other windows do not overlap with the Do-file Editor window.

The icons on the Do-file Editor Toolbar shown in figure 4.3.2b are fairly standard. With

newer version of the Stata you will find additional icons in the toolbar. The first four icons let

you open a new do-file, open an existing do-file, save your file, and print the file. The fifth

icon allows you to perform search, although most people prefer to use Ctrl+f to perform a

search. The next three icons will cut, copy, and paste, although most people prefer to use

Ctrl+x, Ctrl+c, and Ctrl+v, respectively. The next two icons provide undo and redo functions.

The third from last icon will show content of the file in a viewer window. The second to last

icon runs the do-file quietly without sending results to the Result window. Finally, the last

icon runs the do-file yielding output in the Result window for each command. While running

37 | P a g e

the command, you can either run your entire do-file or select specific lines of command. You

select lines of command in the same way you would in a word processor, by highlighting

them. If you want to select a section of several lines, you do not need to highlight all the lines

completely; you can highlight some of each line. Here is an example where we want to run

two commands describe and summarize. In figure 4.3.3, we have selected only part of

describe and summarize commands.

Figure 4.3.3 Highlighting in the Do-file Editor

[Note: In preparing this tutorial, we have removed the line numbers being displayed later in

the do-file editor.]

We recommend that you include the name of the file as a comment in the first line of the do-

file. Placing an asterisk (*) at the beginning of the line marks the text as a comment that Stata

prints but does not interpret. We include this line because our do-file will have the name of

the file that created the output so that we know where to find the file if we need to change

something later on. We would also put the title of the project, date when the do-file was

created and the duration of the project as the comments in the do-file before putting the Stata

commands. Let’s type *my_first.do at the top of our blank do-file.

Another way of adding comments, especially long comments, to a file is by typing /* just

before the comment and */ right after the comment. Anything between the /* and */ will be

treated as a comment. Figure 4.3.4 shows what we will use in our example do-file. This type

of long comment is helpful for organizing a long do-file into sections with a new comment

explaining the purpose of each new section.

Notice that the comments appear in the Do-file Editor in a green font. This makes them easy

to find in a long do-file. Stata commands appear in a bold blue font. However you can

customize the font color and type.

Next we need to open the dataset. In our do-file we type

. use c:/Stata/firstsurvey.dta, clear

The next two commands we need to type are describe and summarize, which will describe

the dataset and then perform a summary of the variables, giving us the number of

observations, mean, standard deviation, minimum value, and maximum value. Unless there

38 | P a g e

are too many variables in your dataset, these are good commands to include at the beginning

of your do-file.

In general, one line equals one command in a do-file. This is a great way to distinguish one

command from another as long as each command is short enough to fit on one line. What

happens when you have a long command that extends for more than one line? Stata needs a

way to know that when you press the Return key, you are not really done with the command.

The solution is to put /// at the end of a line. This tells Stata that the next line is a

continuous of the previous one. We will illustrate the use of /// in the graph pie commands

shown in figure 4.3.4. (We will cover these commands in the next section. For now, just

enter them into the Do-file Editor.)

Saving a do-file: You must remember to save your do-file. You do this much like you would

save a file in any other program; that is, click on File>Save As…. Then you can type the

name of the file; in our example, we are using my_first. You can also browse to find the

project folder where you want to save the file. For this class we are saving all the related files

in the c:/Stata/ folder.

Figure 4.3.4 shows our do-file, my_first.do, as it appears in the Do-file Editor. Notice that

when we saved the do-file, the filename now appears at the top of the Editor.

Run the do-file by clicking on the icon at the far right of the Do-File Editor toolbar.

FIGURE 4.3.4. Commands in the Do-file Editor window

39 | P a g e

4.4 Copying your results to a word processor

In some lab assignments, you will have to show your output. Save your results by

highlighting the text you want to save from the Result window, right click on the highlighted

text, and then select Copy Text from the menu (or using Ctrl+c after you have highlighted

your text). You can then paste the text into your Word processor. It is a good idea to include

the commands that are in the Results window, because these give you a record of what you

did. Except when there is no data manipulation, commands like these are no substitute for a

do-file that includes everything you did in preparing the data.

When you copy results from Stata’s Result window to your Word processor, the format may

not look friendly, such as the things may not line up properly. The simplest solution to this

alignment problem is to change the font and probably the font size, depending on your

margins. We usually use the Courier or Courier New font at 9 point. Sometimes you might

have to tweak the margin of the Word processor.

4.5 Quick Review


How Stata commands are constructed

Creating and Saving a do-file

Copying your results to a word processor

4.6 Exercises

1. The exercise for this section would be incorporated in the exercise of Section 5.

Saving tabular output

For tabular output (say, the result of a summarize command), you may want to select

just that portion of the text that appears as a table, right-click on it, and select Copy

Table, Copy Table as HTML, or Copy as Picture from the menu. You can then paste this

text into your Word processor. Copy Table as HTML option pastes it as a table like one

you would see on a web page. Copy as Picture works nicely as long as you do not need

to format the output to a particular style.

40 | P a g e

Section 5: Descriptive Statistics and Graphs

5.1 Introduction

This section shows how to produce descriptive statistics and graphs using Stata. These

techniques are commonly used to explore the data before making decisions about further

analyses. Descriptive statistics are statistical procedures used to summarize, organize, and

simply data.

Deciding which statistics to use to describe a variable largely depends upon the level of

measurement of the variable: nominal, ordinal, and interval/ratio levels of measurements.

Levels of Measurement

Nominal: At the nominal level of measurement, numbers are assigned to a set of categories

for the purpose of naming, labeling, or classifying the observations, however, the categories

have no specific ordering. Gender is an example of a nominal level variable. Using the

numbers 1 and 2, for instance, we can classify our observations into the categories ‘female’

and ‘male’, with 1 representing female and 2 representing male. The numbers 1 and 2 are

used to represent the different categories; we do not imply anything about the magnitude or

quantitative difference between the categories. Other examples of nominal level variables are

ethnicity, nationality, and race.

Ordinal: Ordinal level variables assign number to rank-ordered categories ranging from

lowest to highest. The classic ordinal level measure is a Likert scale that has categories

strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree. We can

say that a person in the category ‘strongly agree’ agrees with the statement more than a

person in the ‘agree’ category, but we do not know the magnitude of the differences between

the categories. Another example of an ordinal variable is age groups with categories young,

middle-age, and old or 16-40, 41-64, and 65 and over.

Interval/ratio (or Quantitative): For interval/ratio level of measurement the categories

(values) of a variable can be rank-ordered and the differences between these categories

(values) are constant. Examples of variables measured at the interval/ratio level are age,

income, height, and weight. With these variables we can say which value is greater or smaller

and by how much.

Discrete or continuous: Nominal and ordinal measures are categorical variables and are

always discrete. Interval/ratio can either be discrete or continuous. For example, number of

children is interval level but discrete as we say 0, 1, 2, 3,… children and not 2.5 children.

Height is an interval level but is continuous as we say 0 inch, 66 inch, 66.2 inch, 66.7 inch,….

41 | P a g e

5.2 Where is the center of a distribution?

Descriptive statistics are used to describe distribution. Three measures of central tendency

describe the center of a distribution: mode, median, and mean. These are commonly called

averages. Refer to Table 5.2.1 as a general guideline to make a decision about which measure

of central tendency is appropriate for each level of measurement.

Table 5.2.1

Level of measurement Mode Median Mean

Categorical, no order (nominal, e.g., gender) Yes No No

Categorical, ordered (ordinal, e.g. social support) Yes Yes Yes*

Quantitative (interval/ratio, e.g. age) Yes Yes Yes *Many researchers use the mean when there are several ordered categories

5.3 How dispersed is the distribution?

Besides describing the central tendency or average in a distribution, descriptive statistics

describe the variability or dispersion of observations. Are they concentrated in the middle?

Do they trail off in one direction? Are they widely dispersed?

When there are only a few values or categories, we can use a frequency distribution

(tabulation) to describe the variable. This distribution shows each value or category of a

variable and tells how many observations have that value or fall into that category. We can

Averages

Mode: It is the value or the category that occurs most often in a distribution. This can be

applied to unordered categorical [I don’t’ think unordered categorical is defined above]

variables such as gender, marital status, or race/ethnicity.

Median: It is the positional average that divides a distribution into two halves. Half the

observation will have higher value and half will have a lower value than the median. This

can be applied to categories that are ordered (e.g., religiosity, job satisfaction, and level of

agreement) or to quantitative variables (e.g., age, education, and income).

Mean: Commonly known as the arithmetic average, is computed by adding all the scores in

the distribution and dividing by the number of scores. While very high or low values affect

the magnitude of the mean, they have little impact on the median. . Although Warren Buffet

would scarcely change the median income of a community, his moving to a small town

would raise the mean a lot.

42 | P a g e

also use graphs to describe the dispersion of a distribution. When there are only a few

categories being shown, the most common graphs to use are pie charts and bar charts.

For a quantitative variable, we will usually want one number to represent the dispersion. The

standard deviation (SD) is used, especially with variables that have many possible values. SD

is defined as the measure of variability and tells us how much variation or dispersion is there

in your data with respect to the mean of the distribution. In general SD tells how spread out

your data are from the mean. Note that smaller the SD the observations tend to be closer

towards the mean (Figure 5.3.1). As the SD increases the distribution becomes more

dispersed (Figure 5.3.2). For a normal distribution 68.2% of the distribution lies within

±1*SD, 95.4% lies within ±2*SD and 99.6% lies within ±3*SD.

A graph that can show the spread of an ordinal or a quantitative variable is called as

histogram.

Figure 5.3.3 shows that the tail of the distribution is towards the left and hence is called

skewed towards the left. In this type of distribution the mean will be less than the median.

Figure: 5.3.1 Distribution with SD=1

Figure: 5.3.2: Distribution with SD=2

43 | P a g e

Figure 5.3.4 shows that the tail of the distribution is towards the right and hence is called

skewed towards the right. In this type of distribution the mean will be greater than the

median.

5.4 Statistics and graphs – Unordered categories

About all we can do to summarize a categorical variable that is unordered is to report the

mode and show a frequency distribution or a graph (pie chart or bar chart).In this section we

will use the dataset, firstsurvey.dta that was labeled in our previous exercise. This dataset

has categorical (unordered and ordered) and continuous variables. gender, smoked100,

and smokestatus are unordered (nominal) categorical variables.

. use c:/Stata/firstsurvey.dta

Now to get one-way tables of frequency distributions for a variable or a list of variables we

use tabulate or tab1 commands, respectively.

. tabulate gender

Gender of | respondents | Freq. Percent Cum. ------------+----------------------------------- Male | 10 50.00 50.00 Female | 10 50.00 100.00 ------------+----------------------------------- Total | 20 100.00

or . tab1 gender smoked100 smokestatus -> tabulation of gender

Gender of | respondents | Freq. Percent Cum. ------------+----------------------------------- Male | 10 50.00 50.00 Female | 10 50.00 100.00 ------------+----------------------------------- Total | 20 100.00 -> tabulation of smoked100

Smoked at least | 100 cigarettes | in life | Freq. Percent Cum. -----------------+----------------------------------- No | 4 20.00 20.00 Yes | 16 80.00 100.00 -----------------+----------------------------------- Total | 20 100.00

44 | P a g e

-> tabulation of smokestatus Current | smoking | status | Freq. Percent Cum. ------------+----------------------------------- Everyday | 11 68.75 68.75 Somedays | 3 18.75 87.50 Not at all | 2 12.50 100.00 ------------+----------------------------------- Total | 16 100.00

The tabulation for gender tells us that 50% of the sample of 20 respondents comprises

male and the remaining 50% were females. Note that tabulate command can be used for

drawing frequency distribution for only one variable, but tab1 can generate frequency

distribution tables for each of the listed variables such as gender, smoked100, and

smokestatus. The mode for gender is not as predominant a category. For smoked100, 80%

of the respondents smoked more than 100 cigarettes. This is a clear mode as it is so much

more frequent than the other category. The mode for smokestatus is everyday smoker as it

is the most frequently occurring (68.75%) category.

Please note that the total observation for the variable smokestatus is 16 when our sample

size is 20. This is because not all respondents in this survey smoked at least 100 cigarettes in

life. Therefore, there are missing observations for the variable smokestatus and tabulate

command yields frequency distribution for non missing observations. However, if you want

to calculate frequency distribution from the whole sample (including missing observations

represented by .) then use the following command:

. tabulate smokestatus, miss

Current | smoking | status | Freq. Percent Cum. ------------+----------------------------------- Everyday | 11 55.00 55.00 Somedays | 3 15.00 70.00 Not at all | 2 10.00 80.00 . | 4 20.00 100.00 ------------+----------------------------------- Total | 20 100.00

45 | P a g e

In section 4, we wrote a pie chart commands in the do-file. Here we will create the pie chart.

Open your my_first.do file by selecting File>Open. Browse your folder and select the file

my_first.do and click Open. From the my_first.do file select graph pie, over (peersmoke) title (Number of friends who smoke) ///

note (firstsurvey.dta) plabel (_all name)

Then run the selected command by clicking on the icon “Do Selected Line” at the far right of

the Do-File Editor toolbar. This will provide a visual display of the distribution of

peersmoke in our first survey. The size of each piece of the pie is proportional to the

percentage of the people in that category. This pie chart shows that majority of the

respondents had 6-10 friends who smoke cigarettes.

Figure 5.4.1 Pie chart of number of friends who smoke

None

1-5 friends

6-10 friends

>10 friends

None 1-5 friends

6-10 friends >10 friends

firstsurvey.dta

Number of friends who smoke

Obtaining both numbers and value labels

Before doing the tabulations, you might want to type the command numlabel _all,

add. After you enter this command, whenever you type the tabulate command, Stata

reports both the numbers you use for coding the data (1, 2, 3,..) and the value labels

(male, female). Later, if you do not want to include both of these, you can drop the

numerical values by using the command numlabel _all, remove. Practice using

these commands as an exercise on your own.

46 | P a g e

It is possible to improve the default pie chart. The default pie chart is a bit hard to read

because it assumes you want each slice a different color, and this will not work well when

printing in black and white. Because many publications require black and white printing, we

should edit the pie chart.

We can open the Graph Editor by right-clicking on the pie chart and selecting Start Graph

Editor. This expands the window that has the graph and adds a panel on the side of the chart

with things we might want to change (Figure 5.4.2). On the left side is the pie chart we will

edit, and on the right are the names of the parts of the pie chart.

Figure 5.4.2 The Graph Editor

You might be interested to edit some or all of the labels. For example, let us change the label

“None” to “0 friends”. Click on the plus sign by legend and the plus sign by key region.

Double-click on label [1], a window “Textbox properties” will pop-up where we will change

the Text from None to 0 friends. Click Apply and OK. You can also do this by double

clicking on the label None under the pie chart and renaming it as 0 friends. You might

have noticed that the change appears only in the label, however, the name of the plotregion

“None” remains unchanged. Click on the plus sign by plotregion1. Double click on

pielabel[1] and change the text from None to 0 friends. Select color White in the Text

Styles box. Click Apply and then OK. You will also notice that the font color inside the pie

diagram changes to white.

Next click on the plus (+) sign by plotregion1, and then double-click on pieslices[1]. Here we

will pick Black as the Color and 100% as the Fill intensity. For pieslices[2], pick Black and

70%. For pieslices[3], pick Black and 50%. For pieslices[4], pick Black and 30%, check

47 | P a g e

Explode slice, and make sure the distance is Medium. You can check explode to the category

that you want to focus on. Try experimenting with other options. Figure 5.4.3 shows the

edited pie chart.

Figure 5.4.3 Edited Pie Chart

We have only scratched the surface of what you can do with the powerful Graph Editor. For

example, you can click on the T (Add Text Tool) to the left of the graph and then click

somewhere on the figure. A dialog box opens so you can add text. Remember to click Apply

and then click OK. You could then click on the \ (slash or Add Line Tool) just below the T,

and draw a line from the text to the piece of pie it describes.

A bar chart is more attractive than a pie chart for many applications. Type the following

commands in your do-file and run the command

. histogram peersmoke, discrete percent gap(10) addlabel /// xtitle (Number of peer smokers) xlabel /// (, angle(forty_five) valuelabel) title (Number of friends who smoke)

You will get the output as shown in the Figure 5.4.4.

Figure 5.4.4 Bar chart of peer smokers

0 friends

1-5 friends

6-10 friends

>10 friends

0 friends 1-5 friends

6-10 friends >10 friends

firstsurvey.dta


9.091

18.18

45.45

27.27

010

20

30

40

50

Perc

ent

-1

Non

e

1-5

frien

ds

6-10

friend

s

>10 fr

iend

s

Number of peer smokers


48 | P a g e

Preparing a graph becomes easier when you use the dialogue box. Here we will create the

previous pie chart and the bar chart using the dialogue box. In Stata window, select

Graphics>Pie chart and look at the Main tab. If this dialog still has stored information from a

previous chart, you should click on the Reset icon (Figure 5.4.5) to clear the dialog box.

Figure 5.4.5 Reset icon

Type peersmoke in the category variable box. This uses the categories we want to show as

pieces of the pie. Under the Titles tab, enter the title “Number of friends who smoke” and the

name of dataset firstsurvey.dta we used as a Note. Under the Options tab, click on Order

by this variable and type peersmoke. Also check Exclude observations with missing values

(casewise deletion) because we do not want these, if any, to appear as a separate piece of the

pie. Under Slices option, select label properties (all) from the Label section. Select Name in

the Label type box. Click Accept and then click OK. You will obtain similar pie chart as in

Figure 5.4.1.

Now let us obtain the previous bar chart using the dialog box. From the Graphics menu,

select Histogram. Here we are creating a bar chart rather than a histogram, but this is the best

way to produce a high-quality bar chart using Stata.

Bar Charts and Histograms

Bar Chart: A bar chart is a pictorial representation of a frequency distribution of either a

nominal or ordinal data. The various categories into which the observations fall are

presented along a horizontal axis of a bar diagram. A vertical bar is drawn above each

category such that the height of the bar represents either the frequency or the relative

frequency of observations within that category. The bars should be of equal width and

separated from one another so as not to imply continuity.

Histogram:A histogram depicts a frequency distribution for discrete or continuous data. The

horizontal axis in a histogram displays the true limits of the various intervals. The true limits

of an interval are the points that separate it from the intervals on either side. The vertical axis

of a histogram depicts either the frequency or the relative frequency of observations within

each interval. The frequency associated with each interval in a histogram is represented by

the bar’s area and not its height. The area of the entire histogram sums up to 100%, or 1.

49 | P a g e

On the Main tab, type peersmoke in the Variable box. Click on the button next to Data are

discrete. In the section labeled Y axis, click on the button next to Percent. The trick to

making this histogram a bar chart is to click on Bar properties in the lower left of the Main

tab. This opens another dialog box where the default is to have no gap (bar gap=0) between

the bars. Change this to a gap of 10, which sets the gap between bars to 10 percent of the

width of a bar. Click Accept. If you switch to Titles tab, you can enter a title, Number of

friends who smoke. Next switch to the X axis tab and click on Major tick/label properties.

This opens another dialog box where you select the Labels tab and check the box for Use

value labels. Sometimes the value labels are too wide to fit under each bar. You may need to

create new value labels that are shorter. However, if they are just a little bit too wide, you can

change the angle. Click on Angle and select 45 degrees from the drop-down menu. Click on

Accept. Finally switch back to the Main tab. In the lower right corner of the dialog box, click

on Add height labels to bars. Because we are reporting percentages, this option will show the

percentage in each peersmoke at the top of each bar. Click OK and you will get the bar chart

as shown in Figure 5.4.4.

Sometimes you may have a larger number of categories. When this happens, Stata’s default

will show only a limited number of value labels, so some of the bars will be unlabeled. If you

want to label all of them, you need to go to the X axis tab and click on Major tick/label

properties to open a dialog box we opened before. On the Rule tab, click on Suggest # of

ticks and enter the number of bars in the box by Ticks.

5.5 Statistics and graphs- Ordered categories and variables

When our categories are ordered, we can use the median to measure the central tendency.

When there are only a couple of categories, however, the median does not work well. Here is

an example where there are several categories for the education variable. We might want to

report the median or mean and the SD. Let us run commands to get a frequency distribution

and summary details of the variable education.

. tab1 education -> tabulation of education Highest | level of | educational | attainment | Freq. Percent Cum. ------------+----------------------------------- 8 | 4 20.00 20.00 9 | 3 15.00 35.00 10 | 2 10.00 45.00 11 | 1 5.00 50.00 12 | 2 10.00 60.00

50 | P a g e

13 | 2 10.00 70.00 14 | 1 5.00 75.00 15 | 2 10.00 85.00 16 | 2 10.00 95.00 19 | 1 5.00 100.00 ------------+----------------------------------- Total | 20 100.00 . summarize education, detail Highest level of educational attainment ------------------------------------------------------------- Percentiles Smallest 1% 8 8 5% 8 8 10% 8 8 Obs 20 25% 9 8 Sum of Wgt. 20 50% 11.5 Mean 11.75 Largest Std. Dev. 3.274704 75% 14.5 15 90% 16 16 Variance 10.72368 95% 17.5 16 Skewness .5137804 99% 19 19 Kurtosis 2.240513

The frequency distribution produced by tab1 is probably the most useful way to describe the

distribution of an ordered categorical variable. We can see that the distribution is skewed to

the right with higher frequencies at lower level of educational attainment.

Although the tabulation gives us a good description of the distribution, we often will not have

the space to report this level of detail. The median (50th

percentile) is provided by the

summarize command. The median for this distribution is 11.5 corresponding to less than 12

years of educational attainment. Even though these are orderedl categories, many researchers

would report the mean. The mean assumes that the quantitative values 8-20+ are interval-

level measures. Though the mean (Mean = 11.75) is usually a good measure of central

tendency, in this example median is preferred because the distribution of educational

attainment is skewed (Figure 5.5.1). Although this variable is clearly ordinal, many

researchers treat variables like this as if they were interval and rely on the mean as a measure

of central tendency. If we are in doubt about which measure of central tendency to report, it

may be a good idea to report both the median and the mean.

51 | P a g e

Here is how we can create a histogram showing the distribution of educational level.

. histogram education, discrete percent title /// (Highest Level of educational attainment) /// note (firstsurvey.dta) xtitle (Educational level) scheme(s1mono)

Figure 5.5.1. Histogram of educational level attained

Remember, you can try creating a histogram using the dialog box as described in section 5.4.

The histogram in Figure 5.5.1 allows the reader to quickly get a good sense of the

distribution. The bars on the left are a bit taller than the bars on the right, which indicates

that there are a disproportionate number of people with lower levels of education

(Mean=11.75 years, Median=11.25 years, Mode=0-8 years of education, Standard deviation

(SD)=3.27).

5.6 Statistics and graphs-quantitative variables

We will study two variables: age and numcig (number of cigarettes per day). Histogram and

box plots are some common graphs used to display quantitative variables. We will usually

use the mean or median to measure the central tendency for quantitative variables. The SD is

the most widely used measure of dispersion, but a statistic called the interquartile range is

used by the box plots presented in this section.

Let’s start with age, age of the respondents. Computing descriptive statistics for quantitative

variables is easy. Example,

05

10

15

20

Perc

ent

5 10 15 20Educational level

firstsurvey.dta

Highest Level of educational attainment

52 | P a g e

. summarize age, detail Age of respondents ------------------------------------------------------------- Percentiles Smallest 1% 17 17 5% 18 19 10% 19 19 Obs 20 25% 20.5 20 Sum of Wgt. 20 50% 25.5 Mean 28.8 Largest Std. Dev. 10.57106 75% 36.5 41 90% 45 43 Variance 111.7474 95% 50.5 47 Skewness .9892077 99% 54 54 Kurtosis 2.848827

This output shows that the average age of the respondent is 28.8 years. The median is 25.5

years. Because the mean is greater than the median when the distribution is positively

skewed, we can assume that the distribution is positively skewed (trails off on the right side).

The SD is 10.57 years, which tells us that the age of the respondents varies widely.

Let us use some graphs to describe the distribution of age. First we will create a simple

histogram by using frequency instead of percentage. Use the following command:

. histogram age, frequency title (Age of the respondent) /// note (firstsurvey.dta) xtitle (Age in years) scheme (s1mono)

Figure 5.6.1. Histogram of age of the respondents

05

10

15

Fre

qu

en

cy

10 20 30 40 50Age in years

firstsurvey.dta

Age of the respondent

53 | P a g e

This simple command gives us a quick view of the distribution. This graph shows that it is

skewed to the right. Now let us see the distribution for males and females separately.

. histogram age, frequency title (Age of the respondent) /// note (firstsurvey.dta) xtitle (Age in years) scheme (s1mono) by (gender)

Figure 5.6.2. Histogram of age of the respondents by gender

The skeweness still persists for the distribution of age by gender. The graphs show that

compared to females, there are more males in the younger age groups.

Now to get descriptive statistics for both male and female separately, we need a new

command:

. by gender, sort: summarize age -------------------------------------------------------------------------- -> gender = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 10 25.4 6.432556 19 38 ---------------------------------------------------------------------------> gender = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 10 32.2 12.99402 17 54

02

46

8

10 20 30 40 50 10 20 30 40 50firstsurvey.dta firstsurvey.dta

Male Female

Age of the respondent Age of the respondent

Fre

qu

en

cy

Age in yearsGraphs by Gender of respondents

54 | P a g e

This commands sorts the data by gender and then summarizes the variable age for each

gender.

We can customize the output for getting the descriptive statistics by using the following

command:

. tabstat age, statistics (n mean median sd min max) /// by (gender) columns (statistics)

Summary for variables: age by categories of: gender (Gender of respondents) gender | N mean p50 sd min max -------+------------------------------------------------------------ Male | 10 25.4 23.5 6.432556 19 38 Female | 10 32.2 27.5 12.99402 17 54 -------+------------------------------------------------------------ Total | 20 28.8 25.5 10.57106 17 54 --------------------------------------------------------------------

Note that the second command produces the descriptive statistics for age of the total

distribution together with that by gender. Stata refers to the median as p50, or the 50th

percentile. We can change the name of p50 to median in a word file after we copy the output

as text or table (and not as a picture) from the output window and paste it into the word file.

Females were older than males in the distribution. Because the means are bigger than the

medians for each group, we can assume the distribution to be positively skewed (skewed to

the right) as was evident from the histogram in figure 5.6.2.

A horizontal (hbox) or vertical (box) box plot is an alternative way to represent the

distribution of a quantitative variable like age. The command for the horizontal box plot is:

. graph hbox age, over (gender) title (Age of the respondents) /// subtitle (by gender) note (firstsurvey.dta)

Figure 5.6.3. Horizontal box plot of age by gender

20 30 40 50 60Age of respondents

Female

Male

firstsurvey.dta

by gender

Age of the respondents

55 | P a g e

We see that the distribution of age for both male and females are skewed to the right as both

have the longer whiskers towards the right. The length of the box (shaded region) in

horizontal box plot is the interquartile range and the interquartile range for females is larger

than that for males, i.e. the distribution of female is more dispersed than the males. The

vertical line inside the box is the median and the median for male is around 24 years and for

female it is close to 38 years. We can also see a small dot beyond the end of a whisker of the

box plot for males. That observation is regarded as an outlier or an observation with extreme

value.

Try out generating vertical box plot using box in place of hbox in the above command.

5.7 Cross-tabulation for two categorical variables

Cross-tabulation is a technical term for a table that has rows representing one categorical

variable and columns representing another. These tables are sometimes called contingency

tables because the category a person is in on one of the variable is contingent on the category

the person is in on the other variable. For example, the category people are in on being

current smoker may be contingent on their gender. If we have one variable that depends on

the other, we usually put the dependent variable as the column variable and the independent

variable as the row variable in this tutorial.

Let us start with a basic cross-tabulation of whether a person has smoked at least 100

cigarettes in life and their gender. Say you decide that women are less prone to taking health

risk behaviors, like smoking. Therefore, whether a person has smoked at least 100 cigarettes

in life is dependent upon gender, i.e. the variable smoked100 is the dependent variable and

gender is the independent variable in our firstsurvey.dta. Let us use the command:

. tabulate gender smoked100 Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 Female | 2 8 | 10 -----------+----------------------+---------- Total | 5 15 | 20

Now we are interested to compute the percentages so that each row adds up to 100%.

56 | P a g e

. tabulate gender smoked100, row +-------------------+ | Key | |-------------------| | frequency | | row percentage | +-------------------+ Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 | 30.00 70.00 | 100.00 -----------+----------------------+---------- Female | 2 8 | 10 | 20.00 80.00 | 100.00 -----------+----------------------+---------- Total | 5 15 | 20 | 25.00 75.00 | 100.00

The frequencies at the top of each cell are hard to interpret because each row and each

column has a different number of observations. One way to interpret a table is to use

percentage, which takes into account the number of observations within each category of the

independent variable (predictor). The percentages appear just below the frequencies in each

cell. Note that the percentages add up to 100% for each row. Overall, 75% of the respondent

said “yes” they smoked at least 100 cigarettes in life. And 25% said “no”. However, females

(80%) were relatively more likely than the males (70%) to report that they have smoked

>100 cigarettes in their life, regardless of the reason. Thus we say that 70% of the men

compared with 80% of the women smoked at least 100 cigarettes in their life (as opposite to

what we expected). The row option used in Stata command produced row percentage.

5.8 Chi-squared test

The difference between women and men seems small (10%), but is this difference merely

due to chance? Or is the difference statistically significant and gender is actually associated

with having ever smoked at least 100 cigarettes? The sample size matters in coming to a

concrete conclusion.

We use Chi-squares ( 2) statistic to test the likelihood that our result occurred by chance. If it

is extremely unlikely to get this much difference between men and women in a sample of this

size by chance, we can be confident that there was a real difference between women and

men, but we still need to look at the percentages to decide whether the statistically significant

difference is substantial enough to be important.

57 | P a g e

The chi-squared test compares the frequency in each cell with what you would expect the

frequency to be by chance, if there were no relationship. The expected frequency for a cell

depends on how many people are in the row and how many are in the column. Because of the

small sample size, we have very few people in each cell.

In the cross-tabulation command, we would add extra options - chi2 expected such as,

. tabulate gender smoked100, chi2 expected row

The option chi2 produces Pearson’s Chi-squared statistics with p-value. The option expected

would produce the expected frequencies in each cell if there was no relationship between the

two variables. Note that we usually do not ask to produce expected frequencies.

. tabulate gender smoked100, chi2 expected row +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | row percentage | +--------------------+ Gender of | Smoked at least 100 respondent | cigarettes in life s | No Yes | Total -----------+----------------------+---------- Male | 3 7 | 10 | 2.5 7.5 | 10.0 | 30.00 70.00 | 100.00 -----------+----------------------+---------- Female | 2 8 | 10 | 2.5 7.5 | 10.0 | 20.00 80.00 | 100.00 -----------+----------------------+---------- Total | 5 15 | 20 | 5.0 15.0 | 20.0 | 25.00 75.00 | 100.00 Pearson chi2(1) = 0.2667 Pr = 0.606

At the bottom of the table, Stata reports Pearson chi2(1) = 0.2667 and Pr = 0.606,

which would be written as 2 (1, N=20) = 0.2667; p (Pr = 0.606)> 0.05 not significant. Here

we have 1 degree of freedom. Note that whenever Stata reports p=0.000, we have to report it

as p<0.001. [As a general rule, any p-values less than 0.05 is considered to have significant

association between the variables.]

58 | P a g e

To summarize, we can say that women are more likely to report that they smoked at least 100

cigarettes in their life for any reason than are men. In the sample of 20 people, 80% of the

women said that they smoked at least 100 cigarettes in their life for any reason compared

with 70% of the men. This relationship between gender and smoked at least 100 cigarettes in

life was not statistically significant (p=0.606).

5.9 Quick Review


Some basic statistical terms

One-way frequency distribution

Summary statistics

Working with graphs and graph editor

o Pie chart

o Histogram

o Box plot

Cross-tabulation of two categorical variables

Chi-squared test

5.10 Exercises

1. Students are required to submit the output of this exercise. The outputs of this exercise

include i) a word file with the requested graphs and charts and ii) a do-file for the whole

exercise.

Degree of freedom

Degree of freedom is equal to (R - 1)*(C - 1), where R= number of rows and C=

number of columns in a R by C table. In the above table between gender and

smoked100, there are 2 possible outcomes each for independent and dependent

variables. Therefore, it forms a 2 by 2 table. Therefore its degree of freedom=(2-1)*(2-

1) = 1.

59 | P a g e

A. Open a new do-file. Give an appropriate title of the do-file as a comment. You can also

include a brief description of the do-file under the title. Type the commands that are

necessary for each of the following requests in the do-file.

a. Open mynhanes.dta

b. Use the variable ridreth1 and label it as “Race/ethnicity of respondents”.

Assign and attach value labels to the codes of ridreth1. Refer to the website of

CDC/NHANES provided in Section 3 of this tutorial to define values/codes of the

variable ridreth1.

c. Generate a frequency distribution table for the variable ridreth1.

d. Create a pie-chart for ridreth1.

Run your do-file. Copy the pie-chart and paste it in your word file. In the word

file from the pie-chart identify race/ethnicity with largest number of distribution

in the sample.

B. Continue working on your do-file. Now type a command to diagrammatically show the

distribution of weight in kg (bmxwt) by gender. Save your graph in the previous word-

file. Briefly explain the graph.

C. What are the mean and median weights for men and women? Type the command in the

do-file to generate these statistics.

D. Generate a variable for depression based on the severity scale of depression. Name the

new variable as deprs such that deprs=0 if depsev=0 and deprs=1 if depsev=1-5.

Label the new variable and its values.

E. Conduct a Chi-squared test to see if depression (deprs) is associated with having smoked

at least 100 cigarettes in life (smq020). Copy and paste the table generated in the word-

file. Summarize your results.

(Note: If you have not run the commands for B through E, select the commands

associated with these questions and run them.)

Save your do-file and the word-file and send them to your TA’s email:

[email protected].

mailto:[email protected]

60 | P a g e

Bibliography

Acock AC. A Gentle Introduction to Stata. 3rd ed. College Station, TX: Stata Press; 2010.

Kohler U, Kreuter F. Data Analysis Using Stata. 2nd ed. College Station, TX: Stata Press; 2009.

Pagano M, Gauvereau K. Principles of Biostatistics. 2nd ed. USA: Duxbury Press; 2000.

Pevalin D, Robson K. the stata survival manual. 1st ed. New York, NY: Open University Press;

2009.

Date post:	12-Feb-2022
Category:	Documents
Upload:	others
View:	13 times
Download:	1 times