+ All Categories
Home > Documents > Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns...

Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns...

Date post: 31-Mar-2015
Category:
Upload: dylan-jowell
View: 231 times
Download: 3 times
Share this document with a friend
Popular Tags:
29
Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July 19, 2013
Transcript
Page 1: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Optimizing Stata for Analysis of Large Data Sets

Joseph Canner, MHS

Eric Schneider, PhD

Johns Hopkins University

Stata Conference

New Orleans, LA

July 19, 2013

Page 2: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Background

• Programmer/Statistician: 20 years experience with SAS

• Took new job and started using Stata in January 2013

• Reviewed many do-files from predecessors and colleagues in order to learn Stata and understand new job

Page 3: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Caveats

• Large data sets: irrelevant if you don’t use large data sets and/or if you don’t have a system that has sufficient memory to analyze large data sets

• Coding practices: these are examples from real users, but not necessarily trained programmers or Stata experts

Page 4: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Benchmark Testing

• NIS 2010 Core (unless noted otherwise)• 7,800,441 observations• 155 variables• 5.6 Gb memory• 25 ICD-9 diagnosis codes (DX1-DX25)• 15 ICD-9 procedure codes (PR1-PR15)

Page 5: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Benchmark Testing

• Testing code:timer clear 1timer on 1 …Code to be tested…timer off 1timer list 1

• Groups of tests always run at the same time to eliminate issues with different server/memory/usage conditions– 24 core CPU, 256 Gb RAM (50% load), Windows 2008

Page 6: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1: Coding ICD-9 variables

• Option 1:gen FOREACH=0forvalues x = 1/15 { foreach value in "7359" "741" "9955" "640" { replace FOREACH=1 if PR`x'=="`value'" }}

• Time=27.6 sec

Page 7: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1: Coding ICD-9 variables

• Option 2:gen IFOR=0forvalues x = 1/15 { replace IFOR=1 if PR`x'=="7359" | PR`x'=="741" | PR`x'=="9955" | PR`x'=="640"}

• Time=13.2 (half the time!)

Page 8: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1: Coding ICD-9 variables

• Option 3:gen INLIST=0forvalues x = 1/15 { replace INLIST=1 if inlist(PR`x',"7359","741", "9955","640")}

• Time=9.6 sec (a little better than Option 2, and easier to write and read)

Page 9: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 1:gen INLIST1=0replace INLIST1=1 if inlist(PR1,"7359","741","9955","640", "9904","8154","7569","3893")

• Time=1.2 sec

Page 10: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 2a:destring PR1, gen(tempPR1) ignore("incvl")recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1

• Time=118.1 sec (Ouch! Much of the time is devoted to the destring command)

Page 11: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1a: Coding single ICD-9 variablesinlist() vs. recode

• Option 2b (use real() instead of destring):gen tempPR1=real(PR1)recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1

• Time=26.0 sec (much better than destring, but still much slower than inlist())

Page 12: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1b: Coding single ICD-9 variables when there are ranges

• Option 1:split ECODE1, gen(nECODE) parse(E)destring nECODE2, gen(iECODE1)drop nECODE2recode iECODE1 (9200/9209 956 966 986 974 = 1)… (8800/8869 888 9570/9579 9681 9870 =2) (9220/9223 9228 9229 9550/9554 9650/9654 9794 9850/9854 970=3) (8100/8199 9585 9685 9885=4), gen(mech1)recode mech1 (5/10000=5)

• Time= 142.6 sec (Again, split and destring take the bulk of the time here.)

Page 13: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1b: Coding single ICD-9 variables when there are ranges

• Option 2:iECODE1=real(substr(ECODE1,2,4))recode iECODE1 (9200/9209 956 966 986 974 =1)… () () ()…, gen(mech2)recode mech2 (5/10000=5)

• Time= 68.7 sec; better, but…

Page 14: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1b: Coding single ICD-9 variableswhen there are ranges

• Option 3:gen mech3=.replace mech3=1 if (ECODE1>="E9200" & ECODE1<="E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech3=5 if mech3==. & substr(ECODE1,1,1)=="E"

• Time=5.74 sec (a little harder to write, but much faster!)

Page 15: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1b: Coding single ICD-9 variableswhen there are ranges

• Option 4:gen mech4=.replace mech4=1 if inrange(ECODE1,"E9200”,"E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech4=5 if mech3==. & substr(ECODE1,1,1)=="E"

• Time=5.32 sec (a little faster still, and much easier to write)

Page 16: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1: Coding ICD-9 VariablesConclusions

• Using inlist() reduces the time required to recode ICD-9 variables by 65% when searching 15 variables for 4 target codes.

• Performance improves to 80% for 8 codes, and continues to improve slightly thereafter, with a maximum improvement of 92%. (Note: inlist() limit is 10 string codes or 255 numeric codes)

• In order to “stress” the test, the codes used in the test are the most popular, but the results are the same for any set of codes.

Page 17: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #1: Coding ICD-9 VariablesConclusions (cont’d)

• Using recode is much slower than inlist() for lists of single ICD-9 codes, in large part because of the need to convert from string to numeric

• Using recode for ranges is also much slower than replace/if, for the same reason; inrange() also helps with readability

• Can use real() instead of destring, substring() instead of split

Page 18: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #2: Recoding continuous variables

• Option 1:gen AGE1=.replace AGE1=1 if AGE>=0 & AGE <=9replace AGE1=2 if AGE>=10 & AGE <=19…replace AGE1=10 if AGE>=90 & AGE <=120

• Time=6.6 sec

Page 19: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #2: Recoding continuous variables

• Option 2:gen AGE2=recode(AGE,9,19,29,39,49, 59,69,79,89,120)

• Time=0.66 sec (exactly one-tenth of the time(!) and easier to write and read)

• Caution: need to be careful with truly continuous variables that you are cutting at the right place

Page 20: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #2: Recoding continuous variables

• Option 3:recode AGE (0/9=1) (10/19=2) (20/29=3) (30/39=4) (40/49=5) (50/59=6) (60/69=7) (70/79=8) (80/89=9) (90/120=10), gen(AGE3)

• Time=46.3 sec (Ouch!) and harder to write• May be useful for instances where ranges

are not mutually exclusive (i.e., can’t use recode function)

Page 21: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #3: Reordering Values

• Option 1:gen sex_new=sexreplace sex_new=0 if sex_new==3replace sex_new=5 if sex_new==2replace sex_new=4 if sex_new==1replace sex_new=1 if sex_new==5replace sex_new=2 if sex_new==4

• Time=2.0 sec; very cumbersome and hard to follow

Page 22: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #3: Reordering Values

• Option 2:recode sex (3=0) (1=2) (2=1), gen(sex_new1)

• Time=15.0 sec (Ouch! ); but, easier to write and MUCH easier to read)

• Can also use recode to do things like:(3 4 = 0) // 3 and 4 are recoded to 0(3/5 = 0) // 3, 4, and 5 are recoded to 0

Page 23: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #3: Reordering Values

• Option 3:gen sex_new=sexreplace sex_new=0 if sex==3replace sex_new=1 if sex==2replace sex_new=2 if sex==1

• Time=1.4 sec (Faster than Option #1 by 40% and not too hard to read/write)

Page 24: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #4 De-stringing Numeric Values(e.g., NSQIP age)

• Option 1 (Variation of Test #3 Option #1):encode age, gen (age_new)replace age_new=180 if age_new==1…replace age_new=900 if age_new==73replace age_new=18 if age_new==180…replace age_new=90 if age_new==900

• Time=25.8 sec (NSQIP 2011; n=442,149), • Always need to do “tab age_new, nolabel” because

labels are messed up

Page 25: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #4 Destringing Numeric Values(e.g., NSQIP age)

• Option 2:destring age, gen(age_new1) ignore(“+”)

• Time=6.3 sec (NSQIP 2011; n=442,149); four times faster!

• Caution: make sure it is clear that 89=89+

Page 26: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)

• Option 1destring SSN, ignore("-") gen(newSSN1)

• Time=33.0 sec

Page 27: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)

• Option 2:gen long newSSN2=real(subinstr(SSN,"-","",.))

• Time=1.7 sec; almost 20 times faster!• Only useful if there are a few

characters to get rid of.

Page 28: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Future Tests

• Confirm results for 10 years of NIS (about 80 million observations, nearly 50 Gb RAM)

• Other Stata commands where there are multiple ways to do the same thing…any ideas?

• Other programming practices found reviewing code written by colleagues and students

Page 29: Optimizing Stata for Analysis of Large Data Sets Joseph Canner, MHS Eric Schneider, PhD Johns Hopkins University Stata Conference New Orleans, LA July.

Implications

• With 10 years of NIS, could save…– 3 minutes per ICD-9 recode– 1 minute per continuous variable categorization– 6 seconds per variable reorder– A lot more if you used recode

• It all adds up! • Might make it less onerous to run recoding and

cleaning programs more often instead of saving new copies of the dataset

• Easier to read programs


Recommended