7/7/16
1
Lecture 2: Programming Statistics in Stata Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox
Review
• Questions from last lecture? – What is probability?
• What is a probability distribution?
– Types of data: continuous, categorical, binary • Examples of each
– What is a dependent variable? Independent variable? – What is the appropriate statistical test for…
• Comparing two means? Three means? Two proportions? Three proportions?
– What does a p-value tell you?
2
Objectives
• Introduce you to Stata Software and get you started summarizing your data
• Give you basic code to start working with data • Give you code to create new variables • Use Stata to compute descriptive statistics • Use Stata to perform univariate statistical tests (t
test, chi-square test, ANOVA) • Use Stata to create basic graphs
7/7/16
2
Overview
• Stata is software for performing data analysis • Stata interface • Common tasks in Stata
– Import data – Create new variables – Summarize data (means, standard deviations,
histograms) – Perform statistical tests – Graphics
4
Objective
• Most studies start with “Table 1”, which includes – Summary statistics for the sample
• Sample size • Demographics • Disease characteristics • Treatment characteristics
– Stratification by key outcome – Statistical tests comparing characteristics by
stratification
• Objective is to us Stata to create Table 1
6
7/7/16
3
Stata Interface
7
Commands
VariablesCommandHistory
Results
Stata Interface
• There are two ways to interact with Stata – Issue commands directly in the command window – Write commands in a text file (called a “do” file in Stata
parlance) and send commands to the results window
• Always use “do” files – Creates permanent record of your work – Can easily re-use large chunks of code
• Occasionally use the command window
8
Stata Workflow
• Import data into Stata • Create new variables for analysis • Deal with missing values • Perform analyses
– Tables, graphs
• Move tabular results into Excel for formatting • Save graphs as graphic files
9
7/7/16
4
Stata Commands
• Stata syntax is usually a command, followed by variable names to apply to them, by restrictions on observations (if any) and then a comma followed by other options
command varlist if var==x, options
• To execute a command, highlight the entire line of code and press: – Windows: Ctrl+D – Mac: Shift+Cmd+D
10
Importing Data
• Stata can handle several types of raw data files • Comma separated value (.csv) text files seem to
work best • Can save Excel files as .csv files • Start by pointing Stata to the folder where your
data file is stored using cd command (cd means change directory)
11
Importing Data
• Command for importing data is: insheet
Mac/Unix: cd “~/projects/ltd/” Windows: cd “c:\projects\ltd\”
insheet using "ltd_data.csv"
12
7/7/16
5
Stata Interface
13
Create New Variables
• Command to create new variables is generate • Use in conjunction with replace
– Generate creates a new variable and sets all to 0 – Replace then sets the 1s
• Use this to create binary dummy variables • For example, if sex is coded “M” and “F”, we
need to create a male dummy and a female dummy
14
generate male=0 replace male=1 if sex==“M”
generate female=1-male
Example Data
7/7/16
6
Binary Data
• Binary data should be coded as “dummy variables” • Zeros and ones ONLY
Dummy Variables
• We mentioned in the last lecture that we always use 0 and 1 to represent binary variables
• These are called “dummy variables” – Sometimes “binary indicators” for formality
• Use the 1 to indicate the presence of the variable name
• For example – A “male” dummy variable would equal 1 for men and 0
for women – A “died” dummy variable would equal 1 for patients who
died and 0 for patients who did not
Dummy Variables
• For example, if sex is coded “Male” and “Female”, we need to create a male dummy and a female dummy
generate male=0 replace male=1 if sex==“Male”
generate female=1-male
7/7/16
7
Create New Variables
• Notice that Stata differentiates between: – Equals as assignment (male = 1) – Equals as logical (if sex == “Male”)
• This is the most common error you will make (besides misspelling) – Get used to looking for this
19
Categorical Variables
• Categorical data should be coded as dummy variables
• One dummy variable for each category
Continuous Variables
• Sometimes we make categorical variables out of continuous variables
• Select cutpoints based on quartiles, then create 4 categories
7/7/16
8
Create New Variables
• Example: Turn age from a continuous variable into four categories: 0-39, 40-49, 50-59, 60+
generate age039=0
replace age039=1 if age < 40 generate age4049=0
replace age4049=1 if age >= 40 & age < 50
generate age5059=0
replace age5059=1 if age >= 50 & age < 60
generate age60=0
replace age60=1 if age >= 60
22
Dealing with Missing Values
• Most Stata procedures cannot be performed on observations with missing values
• Missing numeric values are stored as a dot (“.”) • Can refer to missing values in code by referring to
the dot
23
Dealing with Missing Values
• There are two options for dealing with missing values 1. Drop the observation altogether 2. Create a category for missing values
• Use option (1) if only a small proportion of observations are missing
• Use option (2) if a relatively large proportions of observations are missing
7/7/16
9
Dropping Observations
• Use the drop command to delete an observation with a missing value
• For example, to drop patients where male is missing – drop if male == .
Create a Missing Category
• To create a missing category, generate a new category • For example, if age has a missing value: generate age039=0
replace age039=1 if age < 40 generate age4049=0
replace age4049=1 if age >= 40 & age < 50
generate age5059=0
replace age5059=1 if age >= 50 & age < 60
generate age60=0
replace age60=1 if age >= 60
generate age_missing = 1 if age == .
Create a Missing Category
7/7/16
10
Errors with generate
• Stata will not let you overwrite a data set or variable unintentionally
• If you need to load a new data set, or reload your data set from text, run the clear command
• If you make a mistake in your code after you generate a new variable – Drop that variable – Then run your generate command again
28
Creating Subsets
• Stata holds a single data set in RAM at a time • To create a subset of observations
– All men, all adults, patients with complete data, etc.
• Use drop if command to drop all other observations
• Use keep if command to keep only observations of interest
29
Creating Subsets
• To look only at male patients the following commands produce equivalent results:
keep if male==1
drop if male==0
30
7/7/16
11
Dropping Variables
• To remove unwanted variables, use the drop command (without the if)
• For example, if we wanted to remove the sex (M or F) variable after creating male and female dummy variables, use
drop sex
31
Saving Data Sets
• After you have completed all data manipulations • Save the data set as native Stata data set • Command is save
save ltd, replace
• Must use the replace option if you save more than once (which is always!)
32
Loading Saved Data Sets
• To load a Stata data set that you have previously saved, the command is use
clear
use ltd
• Always start with clear
– Stata will not let you overwrite data unintentionally
33
7/7/16
12
Summarizing Data: Tables
• Use tabulate to produce a simple summary of counts of elements in the variable – For example, tabulate female
female | Freq. Percent Cum.
------------+-----------------------------------
0 | 419 53.93 53.93
1 | 358 46.07 100.00
------------+----------------------------------- Total | 777 100.00
34
Summarizing Data: Tables
• Can also get a cross-tabulation by listing two variables: tabulate female ssi, row col
35
| ssi female | 0 1 | Total -----------+----------------------+---------- 0 | 246 173 | 419 | 58.71 41.29 | 100.00 | 50.72 59.25 | 53.93 -----------+----------------------+---------- 1 | 239 119 | 358 | 66.76 33.24 | 100.00 | 49.28 40.75 | 46.07 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00
Summarizing Data: Tables
• To obtain a table of basic summary statistics, use the command summarize
• For example – summarize age female male black nonblack
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 777 44.91645 17.29986 .2575342 77.53151 male | 777 .5392535 .4987778 0 1 female | 777 .4607465 .4987778 0 1 black | 777 .043758 .2046881 0 1 nonblack | 777 .956242 .2046881 0 1 -------------+--------------------------------------------------------
36
7/7/16
13
Exporting Summaries
• Stata output is generally not aesthetically pleasing enough to place directly into papers
• Best to move summary data into Excel for formatting, then copy to paper
• To move summary data, highlight desired table, right click, and select Copy Table
• This will paste into Excel cells
37
Analyst Pro Tip!!
• To create tables for publications, copy the raw data from Stata into Excel, but DO NOT FORMAT IT
• Instead, create a formatted table next to the raw table and use formulas to create a clean, publication quality table
• This tip will save you mounds of time – It will probably get you tenure
38
Copy Table in Stata
7/7/16
14
Statistical Tests
• Suppose you want to know whether patients with SSI were older than those without an SSI
• To perform a t test: ttest depvar, by(indepvar) – Example: ttest age, by(ssi)
40
Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 485 45.82019 .7510179 16.53945 44.34453 47.29585 1 | 292 43.41537 1.078256 18.42524 41.2932 45.53754 ---------+-------------------------------------------------------------------- combined | 777 44.91645 .6206292 17.29986 43.69814 46.13476 ---------+-------------------------------------------------------------------- diff | 2.404821 1.279332 -.1065454 4.916186 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 1.8797 Ho: diff = 0 degrees of freedom = 775 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9697 Pr(|T| > |t|) = 0.0605 Pr(T > t) = 0.0303
Statistical Tests
• Supposed you want to know whether patients with SSI have a higher mortality rate
• To perform a chi-square test:
tabulate depvar indepvar, row col chi2
Example: tabulate died ssi, row col chi2
41
+-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | +-------------------+ | ssi died | 0 1 | Total -----------+----------------------+---------- 0 | 407 227 | 634 | 64.20 35.80 | 100.00 | 83.92 77.74 | 81.60 -----------+----------------------+---------- 1 | 78 65 | 143 | 54.55 45.45 | 100.00 | 16.08 22.26 | 18.40 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 4.6322 Pr = 0.031
Statistical Tests • Assume you want to know whether LOS differs
across patients with better matched organs • To perform an ANOVA: anova depvar indepvar
Example: anova los abmm
42
Number of obs = 777 R-squared = 0.0168 Root MSE = 29.2585 Adj R-squared = 0.0118 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 11325.344 4 2831.33599 3.31 0.0106 | abmm | 11325.344 4 2831.33599 3.31 0.0106 | Residual | 660879.608 772 856.061669 -----------+---------------------------------------------------- Total | 672204.952 776 866.243495
7/7/16
15
Stata Graphics
• Stata has excellent facilities for graphics • The overall look and feel of a Stata graph is
determined by a “scheme” – Schemes are predefined graphics parameters that
determine all aspects of the graph
• To see what schemes are available: graph query, schemes
• To set a scheme: set scheme economist
43
Histogram
• To obtain a basic histogram of varname, type: histogram varname
• For example, a histogram of age
44
Schemes Again
• Change scheme to Economist: set scheme economist
45
7/7/16
16
Schemes
• sj scheme
46
Graphics Options
• There are options for – Adding a title (title) – Altering the scale of the axes (xscale, yscale) – Specifying what axis labels to use (xtitle, ytitle) – Changing the markers used (msymbol)
• For example, to finish our histogram histogram age, title(“Distribution of Age”) xtitle(“Age at Transplant”) ytitle(“Density”)
47
Final Histogram
48
7/7/16
17
Scatterplot
• To display a scatterplot of two (or more) variables, type: scatter costs los
• Other options apply
49
50
Exporting Graphs
• To export your graph use – graph export filename.ext, as(type)
• Use pdf or eps for your graphs: – graph export graph1.pdf, as(pdf) – graph export graph1.eps, as(eps)
51
7/7/16
18
Homework
• Get the liver transplant data set from the website • Reproduce the table on the next slide
– Generate the numbers using the summarize command – Move the data into Excel – Recreate table in Excel – Perform statistical tests and insert p-values
• Reproduce the graph on the following slide
52
53
54