+ All Categories
Home > Documents > Efficient SAS programming with Large Data

Efficient SAS programming with Large Data

Date post: 13-Jan-2016
Category:
Upload: tevy
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Efficient SAS programming with Large Data. Aidan McDermott Computing Group, March 2007. Axes if Efficiency. processing speed: CPU real storage: disk memory … user: functionality interface to other systems ease of use learning user development: methodologies reusable code - PowerPoint PPT Presentation
29
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007
Transcript
Page 1: Efficient SAS programming with Large Data

Efficient SAS programming with Large Data

Aidan McDermott

Computing Group, March 2007

Page 2: Efficient SAS programming with Large Data

Axes if Efficiency

• processing speed:– CPU– real

• storage:– disk– memory– …

• user:– functionality– interface to other systems– ease of use– learning

• user development:– methodologies– reusable code– facilitate extension, rewriting– maintenance

Page 3: Efficient SAS programming with Large Data

Dataset / Table

Page 4: Efficient SAS programming with Large Data

• Datasets consist of three parts

Page 5: Efficient SAS programming with Large Data

General (and obvious) principles

• Avoid doing the job if possible

• Keep only the data you need to perform a particular task (use drop, keep, where and if’s)

Page 6: Efficient SAS programming with Large Data

Combining datasets -- concatenation

Page 7: Efficient SAS programming with Large Data

General (and obvious) principles

• Often efficient methods were written to perform the required task – use them.

Page 8: Efficient SAS programming with Large Data

General (and obvious) principles

• Often efficient methods were written to perform other tasks – use them with caution.

• Write data driven code– it’s easier to maintain data than to update code

• Use length statements to limit the size of variables in a dataset to no more than is needed.– don’t always know what size this should be, don’t

always produce your own data.

• Use formatted data rather than the data itself

Page 9: Efficient SAS programming with Large Data

Memory resident datasets

Page 10: Efficient SAS programming with Large Data

Compressing Datasets

• Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job– delays execution and there is need to keep track of

data and program dependency.

• Use a general purpose compression utility and decompress it within SAS for sequential access.– system dependent (need a named pipe), sequential

dataset storage.

Page 11: Efficient SAS programming with Large Data

Compressing Datasets

Page 12: Efficient SAS programming with Large Data

SAS internal Compression

• allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much.

• “There is a trade-off between data size and CPU time”.

Page 13: Efficient SAS programming with Large Data

• indata is a large dataset and you want to produce a version of indata without any observations

Page 14: Efficient SAS programming with Large Data

The data step is a two stage process• compile phase• execute phase

Page 15: Efficient SAS programming with Large Data

Data step logic

Page 16: Efficient SAS programming with Large Data

Data step logic

Page 17: Efficient SAS programming with Large Data
Page 18: Efficient SAS programming with Large Data

data step

Page 19: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y

gender C 1 n y

admit N 8 n y date8.

length N 8 n y

discharge N 8 n n date8.

_N_

_ERROR_ 0

PDV: compile phase

Page 20: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8.

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 21: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 22: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run; /* implicit output */

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 23: Efficient SAS programming with Large Data

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8.

_N_ 2

_ERROR_ 0

data admits; set admits; discharge = admit + length; format discharge date8.;run;

PDV: execute phase

Page 24: Efficient SAS programming with Large Data

Efficiency: suspend the PDV activities

Page 25: Efficient SAS programming with Large Data

General principles

• Use by processing whenever you can• Given the data below, for each region, siteid,

and date, calculate the mean and maximum ozone value.

Page 26: Efficient SAS programming with Large Data

General principles

• Easy:

Page 27: Efficient SAS programming with Large Data

General principles

• Suppose there are multiple monitors at each site and you still need to calculate the daily mean?– Combine multiple observations onto one line and

then compute the statistics?

• Suppose you want the 10% trimmed mean?

• Suppose you want the second maximum?– Use Arrays to sort the data?– Write your own function?

Page 28: Efficient SAS programming with Large Data
Page 29: Efficient SAS programming with Large Data

Recommended