Efficient SAS programming with Large Data

Efficient SAS programming with Large Data

Aidan McDermott

Computing Group, March 2007

Axes if Efficiency

• processing speed:– CPU– real

• storage:– disk– memory– …

• user:– functionality– interface to other systems– ease of use– learning

• user development:– methodologies– reusable code– facilitate extension, rewriting– maintenance

Dataset / Table

• Datasets consist of three parts

General (and obvious) principles

• Avoid doing the job if possible

• Keep only the data you need to perform a particular task (use drop, keep, where and if’s)

Combining datasets -- concatenation


• Often efficient methods were written to perform the required task – use them.


• Often efficient methods were written to perform other tasks – use them with caution.

• Write data driven code– it’s easier to maintain data than to update code

• Use length statements to limit the size of variables in a dataset to no more than is needed.– don’t always know what size this should be, don’t

always produce your own data.

• Use formatted data rather than the data itself

Memory resident datasets

Compressing Datasets

• Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job– delays execution and there is need to keep track of

data and program dependency.

• Use a general purpose compression utility and decompress it within SAS for sequential access.– system dependent (need a named pipe), sequential

dataset storage.

Compressing Datasets

SAS internal Compression

• allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much.

• “There is a trade-off between data size and CPU time”.

• indata is a large dataset and you want to produce a version of indata without any observations

The data step is a two stage process• compile phase• execute phase

Data step logic

Data step logic

data step

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y

gender C 1 n y

admit N 8 n y date8.

length N 8 n y

discharge N 8 n n date8.

_N_

_ERROR_ 0

PDV: compile phase



patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21


_N_ 1

_ERROR_ 0

PDV: execute phase




gender C 1 n y M


length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase

data admits; set admits; discharge = admit + length; format discharge date8.;run; /* implicit output */



gender C 1 n y M


length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase



gender C 1 n y M


length N 8 n y 21


_N_ 2

_ERROR_ 0


PDV: execute phase

Efficiency: suspend the PDV activities

General principles

• Use by processing whenever you can• Given the data below, for each region, siteid,

and date, calculate the mean and maximum ozone value.

General principles

• Easy:

General principles

• Suppose there are multiple monitors at each site and you still need to calculate the daily mean?– Combine multiple observations onto one line and

then compute the statistics?

• Suppose you want the 10% trimmed mean?

• Suppose you want the second maximum?– Use Arrays to sort the data?– Write your own function?

Date post:	13-Jan-2016
Category:	Documents
Upload:	tevy
View:	25 times
Download:	0 times

Efficient SAS programming with Large Data

Documents