Efficient SAS programming with Large Data
Aidan McDermott
Computing Group, March 2007
Axes if Efficiency
• processing speed:– CPU– real
• storage:– disk– memory– …
• user:– functionality– interface to other systems– ease of use– learning
• user development:– methodologies– reusable code– facilitate extension, rewriting– maintenance
Dataset / Table
• Datasets consist of three parts
General (and obvious) principles
• Avoid doing the job if possible
• Keep only the data you need to perform a particular task (use drop, keep, where and if’s)
Combining datasets -- concatenation
General (and obvious) principles
• Often efficient methods were written to perform the required task – use them.
General (and obvious) principles
• Often efficient methods were written to perform other tasks – use them with caution.
• Write data driven code– it’s easier to maintain data than to update code
• Use length statements to limit the size of variables in a dataset to no more than is needed.– don’t always know what size this should be, don’t
always produce your own data.
• Use formatted data rather than the data itself
Memory resident datasets
Compressing Datasets
• Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job– delays execution and there is need to keep track of
data and program dependency.
• Use a general purpose compression utility and decompress it within SAS for sequential access.– system dependent (need a named pipe), sequential
dataset storage.
Compressing Datasets
SAS internal Compression
• allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much.
• “There is a trade-off between data size and CPU time”.
• indata is a large dataset and you want to produce a version of indata without any observations
The data step is a two stage process• compile phase• execute phase
Data step logic
Data step logic
data step
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y
gender C 1 n y
admit N 8 n y date8.
length N 8 n y
discharge N 8 n n date8.
_N_
_ERROR_ 0
PDV: compile phase
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 1
_ERROR_ 0
PDV: execute phase
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
PDV: execute phase
data admits; set admits; discharge = admit + length; format discharge date8.;run; /* implicit output */
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
PDV: execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 2
_ERROR_ 0
data admits; set admits; discharge = admit + length; format discharge date8.;run;
PDV: execute phase
Efficiency: suspend the PDV activities
General principles
• Use by processing whenever you can• Given the data below, for each region, siteid,
and date, calculate the mean and maximum ozone value.
General principles
• Easy:
General principles
• Suppose there are multiple monitors at each site and you still need to calculate the daily mean?– Combine multiple observations onto one line and
then compute the statistics?
• Suppose you want the 10% trimmed mean?
• Suppose you want the second maximum?– Use Arrays to sort the data?– Write your own function?