Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | dylan-jowell |
View: | 231 times |
Download: | 3 times |
Optimizing Stata for Analysis of Large Data Sets
Joseph Canner, MHS
Eric Schneider, PhD
Johns Hopkins University
Stata Conference
New Orleans, LA
July 19, 2013
Background
• Programmer/Statistician: 20 years experience with SAS
• Took new job and started using Stata in January 2013
• Reviewed many do-files from predecessors and colleagues in order to learn Stata and understand new job
Caveats
• Large data sets: irrelevant if you don’t use large data sets and/or if you don’t have a system that has sufficient memory to analyze large data sets
• Coding practices: these are examples from real users, but not necessarily trained programmers or Stata experts
Benchmark Testing
• NIS 2010 Core (unless noted otherwise)• 7,800,441 observations• 155 variables• 5.6 Gb memory• 25 ICD-9 diagnosis codes (DX1-DX25)• 15 ICD-9 procedure codes (PR1-PR15)
Benchmark Testing
• Testing code:timer clear 1timer on 1 …Code to be tested…timer off 1timer list 1
• Groups of tests always run at the same time to eliminate issues with different server/memory/usage conditions– 24 core CPU, 256 Gb RAM (50% load), Windows 2008
Test #1: Coding ICD-9 variables
• Option 1:gen FOREACH=0forvalues x = 1/15 { foreach value in "7359" "741" "9955" "640" { replace FOREACH=1 if PR`x'=="`value'" }}
• Time=27.6 sec
Test #1: Coding ICD-9 variables
• Option 2:gen IFOR=0forvalues x = 1/15 { replace IFOR=1 if PR`x'=="7359" | PR`x'=="741" | PR`x'=="9955" | PR`x'=="640"}
• Time=13.2 (half the time!)
Test #1: Coding ICD-9 variables
• Option 3:gen INLIST=0forvalues x = 1/15 { replace INLIST=1 if inlist(PR`x',"7359","741", "9955","640")}
• Time=9.6 sec (a little better than Option 2, and easier to write and read)
Test #1a: Coding single ICD-9 variablesinlist() vs. recode
• Option 1:gen INLIST1=0replace INLIST1=1 if inlist(PR1,"7359","741","9955","640", "9904","8154","7569","3893")
• Time=1.2 sec
Test #1a: Coding single ICD-9 variablesinlist() vs. recode
• Option 2a:destring PR1, gen(tempPR1) ignore("incvl")recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1
• Time=118.1 sec (Ouch! Much of the time is devoted to the destring command)
Test #1a: Coding single ICD-9 variablesinlist() vs. recode
• Option 2b (use real() instead of destring):gen tempPR1=real(PR1)recode tempPR1 (7359 741 9955 640 9904 8154 7569 3893 = 1) (else=0), gen(RECODE)drop tempPR1
• Time=26.0 sec (much better than destring, but still much slower than inlist())
Test #1b: Coding single ICD-9 variables when there are ranges
• Option 1:split ECODE1, gen(nECODE) parse(E)destring nECODE2, gen(iECODE1)drop nECODE2recode iECODE1 (9200/9209 956 966 986 974 = 1)… (8800/8869 888 9570/9579 9681 9870 =2) (9220/9223 9228 9229 9550/9554 9650/9654 9794 9850/9854 970=3) (8100/8199 9585 9685 9885=4), gen(mech1)recode mech1 (5/10000=5)
• Time= 142.6 sec (Again, split and destring take the bulk of the time here.)
Test #1b: Coding single ICD-9 variables when there are ranges
• Option 2:iECODE1=real(substr(ECODE1,2,4))recode iECODE1 (9200/9209 956 966 986 974 =1)… () () ()…, gen(mech2)recode mech2 (5/10000=5)
• Time= 68.7 sec; better, but…
Test #1b: Coding single ICD-9 variableswhen there are ranges
• Option 3:gen mech3=.replace mech3=1 if (ECODE1>="E9200" & ECODE1<="E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech3=5 if mech3==. & substr(ECODE1,1,1)=="E"
• Time=5.74 sec (a little harder to write, but much faster!)
Test #1b: Coding single ICD-9 variableswhen there are ranges
• Option 4:gen mech4=.replace mech4=1 if inrange(ECODE1,"E9200”,"E9209") | inlist(ECODE1,"E956","E966", "E986","E974")…replace mech4=5 if mech3==. & substr(ECODE1,1,1)=="E"
• Time=5.32 sec (a little faster still, and much easier to write)
Test #1: Coding ICD-9 VariablesConclusions
• Using inlist() reduces the time required to recode ICD-9 variables by 65% when searching 15 variables for 4 target codes.
• Performance improves to 80% for 8 codes, and continues to improve slightly thereafter, with a maximum improvement of 92%. (Note: inlist() limit is 10 string codes or 255 numeric codes)
• In order to “stress” the test, the codes used in the test are the most popular, but the results are the same for any set of codes.
Test #1: Coding ICD-9 VariablesConclusions (cont’d)
• Using recode is much slower than inlist() for lists of single ICD-9 codes, in large part because of the need to convert from string to numeric
• Using recode for ranges is also much slower than replace/if, for the same reason; inrange() also helps with readability
• Can use real() instead of destring, substring() instead of split
Test #2: Recoding continuous variables
• Option 1:gen AGE1=.replace AGE1=1 if AGE>=0 & AGE <=9replace AGE1=2 if AGE>=10 & AGE <=19…replace AGE1=10 if AGE>=90 & AGE <=120
• Time=6.6 sec
Test #2: Recoding continuous variables
• Option 2:gen AGE2=recode(AGE,9,19,29,39,49, 59,69,79,89,120)
• Time=0.66 sec (exactly one-tenth of the time(!) and easier to write and read)
• Caution: need to be careful with truly continuous variables that you are cutting at the right place
Test #2: Recoding continuous variables
• Option 3:recode AGE (0/9=1) (10/19=2) (20/29=3) (30/39=4) (40/49=5) (50/59=6) (60/69=7) (70/79=8) (80/89=9) (90/120=10), gen(AGE3)
• Time=46.3 sec (Ouch!) and harder to write• May be useful for instances where ranges
are not mutually exclusive (i.e., can’t use recode function)
Test #3: Reordering Values
• Option 1:gen sex_new=sexreplace sex_new=0 if sex_new==3replace sex_new=5 if sex_new==2replace sex_new=4 if sex_new==1replace sex_new=1 if sex_new==5replace sex_new=2 if sex_new==4
• Time=2.0 sec; very cumbersome and hard to follow
Test #3: Reordering Values
• Option 2:recode sex (3=0) (1=2) (2=1), gen(sex_new1)
• Time=15.0 sec (Ouch! ); but, easier to write and MUCH easier to read)
• Can also use recode to do things like:(3 4 = 0) // 3 and 4 are recoded to 0(3/5 = 0) // 3, 4, and 5 are recoded to 0
Test #3: Reordering Values
• Option 3:gen sex_new=sexreplace sex_new=0 if sex==3replace sex_new=1 if sex==2replace sex_new=2 if sex==1
• Time=1.4 sec (Faster than Option #1 by 40% and not too hard to read/write)
Test #4 De-stringing Numeric Values(e.g., NSQIP age)
• Option 1 (Variation of Test #3 Option #1):encode age, gen (age_new)replace age_new=180 if age_new==1…replace age_new=900 if age_new==73replace age_new=18 if age_new==180…replace age_new=90 if age_new==900
• Time=25.8 sec (NSQIP 2011; n=442,149), • Always need to do “tab age_new, nolabel” because
labels are messed up
Test #4 Destringing Numeric Values(e.g., NSQIP age)
• Option 2:destring age, gen(age_new1) ignore(“+”)
• Time=6.3 sec (NSQIP 2011; n=442,149); four times faster!
• Caution: make sure it is clear that 89=89+
Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)
• Option 1destring SSN, ignore("-") gen(newSSN1)
• Time=33.0 sec
Test #4a Removing Characters from ID Numbers (e.g., XXX-XX-XXXX)
• Option 2:gen long newSSN2=real(subinstr(SSN,"-","",.))
• Time=1.7 sec; almost 20 times faster!• Only useful if there are a few
characters to get rid of.
Future Tests
• Confirm results for 10 years of NIS (about 80 million observations, nearly 50 Gb RAM)
• Other Stata commands where there are multiple ways to do the same thing…any ideas?
• Other programming practices found reviewing code written by colleagues and students
Implications
• With 10 years of NIS, could save…– 3 minutes per ICD-9 recode– 1 minute per continuous variable categorization– 6 seconds per variable reorder– A lot more if you used recode
• It all adds up! • Might make it less onerous to run recoding and
cleaning programs more often instead of saving new copies of the dataset
• Easier to read programs