sas1_verybasicSAS

The Very Basics of SAS (Windows)

Statistical Computing Group @ Research Data Services

University of Pennsylvania Last modified: 12/09/2008

This online workshop is an introductory note for SAS beginners. Our goal is to give you a broad overview of SAS and help lower your “start-up cost” of learning. Our assumption is:

(1) You are just starting to learn SAS and want to know how to do simple data management and statistical tasks in this statistical software, or

(2) You have tried SAS a bit before, but would like to get back into it and/or want to refresh your memories about oft-used basic procedures.

For such SAS-newbie people, this workshop is a good place to start. SAS is a statistical package very widely used in government, business, and academia. Its strengths include a powerful and flexible data manipulation capability and a wide variety of statistical procedures. Its marketability and data-management/analytic power make SAS one of the good candidates when you want to decide which statistical software to learn. * For a comparison of different statistical packages, see John Marcotte’s presentation. If you come to SAS from SPSS looking for a more comprehensive statistical analytic tool, you will need quite a bit of change in your mindset. If you are from Stata, you will notice different ways the two packages operate. In this workshop, we will try to help you make this adjustment so that you will feel comfortable about how SAS works and can further learn SAS yourself according to your future research needs. Here is a folder that contains the data sets we use in this workshop. In the examples below, I assume you place it at “C:\”. Contents 1. Getting Started ............................................................................................................................ 2 2. How To Read In Data ............................................................................................................... 11 3. How To (Read And) Modify Data ............................................................................................ 20 4. How To Get Descriptive Statistics............................................................................................ 38 5. Analysis Example ..................................................................................................................... 50

1

http://www.sas.com/index.html

http://www.sas.com/govedu/index.html

http://www.sas.com/industry/index.html

http://www.sas.upenn.edu/computing/ssc/presentations/statistical-software.swf

http://www.ssc.upenn.edu/scg/sas/verybasicsas.zip

1. Getting Started Quick Interface Tour Let me first launch SAS and take you on a quick interface tour.

Start > Programs > SAS > SAS 9.x.x (English) We assume you are using the version 9.2 here. Here are your SAS windows below.

6 7 8

3

4 1

5

2 You have five SAS windows (in red) and three sets of useful tools at the top to perform basic tasks (in orange). Let’s take a look at them one by one.

1. Explorer: In this point-and-click interface window, you have quick access to your libraries (more later) and SAS data files, as well as your commonly used files that you assign shortcuts to (File Shortcuts) and Favorite Folders (your My Documents and Desktop). Further, in SAS 9.1 or above, there is also a link to My Computer. Play with them to get a feel.

2. Results (under the Explorer window; you can make it active by clicking on the “Results” tab): This window should look familiar to SPSS users, as the idea is the same; consider this window as a table of contents for your Output window. It shows you the results tree & list, and by double-clicking on those small icons you can jump to the part

2

of results you want to see in your Output window. Also, you can print out all or part of the results you need by right-clicking on the corresponding result icon and selecting Print.

3. Log: Notes about your SAS session appear in this window. When you run a SAS program to perform your data/computation task, you will see notes, warnings, errors, and your program statements in this window. Note that unlike Stata log files, SAS logs do not include actual computation results (result reports appear in the Output window). It is important to check the SAS log as it includes important information and helps you identify potential/actual problems with your session.

4. Output (under the Log/Editor windows now): Most of your results are displayed here. 5. Enhanced Editor: This is a text editor where you write your SAS program. The

Enhanced Editor is available since SAS version 8.0. As you will see later, the nicest thing about this editor is it color-codes elements of SAS program, such as procedures, keywords, informats/format, dates, numeric/string constants, etc. to help you to write programs without basic but critical errors. For example, you can immediately see it if you forget to close your quotes or to place semicolons. We will see more about this later.

And some tools to perform basic tasks.

6. Pull-down Menu: They are just standard Windows pull-down menus. Some are general, such as open/save files, cut/paste/copy, and print, whereas others are SAS-specific, such as those under “Run.” More later.

7. SAS Command Bar: You can type in SAS commands here to do simple tasks (most of which can also be performed through the pull-down menus). As you can see in the figure above, I also have a command line at the top of each SAS windows (i.e., Command===>) as I just personally find it convenient to have it everywhere. For example, type in “help” and SAS instantly brings you the help menu. It’s a matter of your personal preference. You can turn on and off these Command===> lines by typing in “command” in the space of 7 or by going from the pull-down menus, Tools > Options > Preferences… > View and checking the box “Command line.” Feel free to play with it.

8. Tool Bar: Some oft-used commands also available through the pull-down menu are here with quick button access. The help button is greatly useful. You can search for SAS commands, their specific statements and options, and examples.

Before moving on to data work in SAS, there are a couple of basic but important ideas you need to know about how SAS operates.

• Libraries • Two building blocks of SAS programs (big picture of how SAS program is constructed). • Programming rules. • Temporary/permanent SAS files.

Let’s go over each of these.

3

Libraries In SAS, your SAS data sets are stored in a library. All you need to do to create and use your library is just decide where you store your SAS data sets (usually the directory of your research project folder), give it a nickname, and tell SAS the location/nickname information. Consider a SAS library as a “link” that connects you to your project folder location. Once you create a library, you can also tell SAS to use that link and find your data set stored there.

Let’s go to the SAS Explorer, make the Explorer tab active, and double-click the small, yellow file cabinet icon at the upper left corner named “Libraries.”

Then you’ll see file drawer icons under “Active Libraries” (below). These icons represent libraries and each icon’s name is their nickname called libref (= library reference name).

Those four libraries you see there are there by default. You can add your own, so that you can save your SAS files there. To do so, tell SAS a libref you make up and what location it refers to. Suppose, for example, you are starting a project about performance of foreign/domestic makers’ cars and have created a project folder “prj_auto” located at “C:\verybasicsas” (or whatever drive you put this folder at). And suppose that to do this project work in SAS, you want to create a library named “auto” that links to the folder “C:\verybasicsas\prj_auto.”

4

There are two different approaches to create your library. 1. Using point-and-click approach. 2. Writing SAS statements.

Let’s start with the point-and-click approach. Two routes to go: you can (1) right-click on the SAS Explorer window to bring out the pull down menu (below), and select New…, or (2) click on the file cabinet icon located in the tool bar.

(2)

(1)

Either way, New Library dialogue box shows up (below).

As planned, we name this library “auto,” which links to the location “C:\verybasicsas\prj_auto”, and because we know we will keep working on this project for quite some time to come (well, pretend so!), we also check the box “Enable at startup” so that every time we launch SAS this library is also activated. Then click OK. Now you should be seeing your new library “Auto” in the SAS Explorer (below).

5

This library links to the location “C:\verybasicsas\prj_auto,” and the libref “Auto” serves as a nickname of this location. Double-click on the icon, and you see its contents, in this case a SAS data set “autodata.”

You can double-click on the “Autodata” data set to open and see the content in a spread-sheet format. To use this data set in data manipulation/analysis, you need to tell SAS where it is. To do so, tell SAS this libref, period (“.”) then the data set name, like this:

auto.autodata Then SAS knows “C:\verybasicsas\prj_auto” is the place to search for your “autodata” data set (we’ll see how this actually is used, in the next section right after this). This is also how to name and save your data set when you create a data set “autodata” in a library named “auto.” We’ll talk more about this later. To move one level back up in the SAS Explorer, use the Up One Level icon at the tool bar.

The same task can be performed by writing a brief SAS statement as well. To write SAS statements, type in your SAS Enhanced Editor: libname auto "c:\verybaskcsas\prj_auto"; run; What this program does should be obvious; it tells SAS to create library named “auto” that links to the location of “c:\verybasicsas\prf_auto.” The statement run tells SAS to execute the above line. Highlight the two lines, then click on the runner icon at the tool bar (yes, “Run”).

Libref (location) Period (“.”) SAS data set name

OR, alternatively, type in “submit” (you can shorten it to “sub”) into the command bar of the Enhanced Editor and hit your keyboard’s Enter key.

6

If you come from Stata, you probably have different directories for different projects and move from one to another by the command –cd–. Although SAS’s libraries are a somewhat similar idea in that you set up a location (library) to store your SAS data sets, there are major differences. First, once you have set up your libraries, you simply mention their names (libref) thereafter and SAS understands where your file is located and let you access it. Second, while in Stata you need to change directories every time you need to access files in different locations and load one data set at a time onto Stata’s memory, SAS libraries allow you to simultaneously open and access multiple data sets from different libraries (hence, from different file locations). Two Building Blocks of SAS Programs: DATA and PROC Let’s start talking about SAS programs. The first step is to understand this: SAS programs consist of two building blocks: (1) DATA steps and (2) PROC steps. Each of these blocks consists of SAS statements. Usually, in SAS programs,

(1) DATA steps (a) read and (b) modify data and write it into a SAS data set. (2) PROC steps use SAS data sets thus created to conduct analysis and produce report.

Suppose, for example, that I have a SAS data set named “auto,” type the following SAS programs (on the left hand side) in my Enhanced Editor, and submit the program to SAS. * Create SAS data set "ex1"; Data ex1; set auto.Autodata; /* To metrics, to kg & to cm */ wgtkg = weight * 0.454; _lngcm = length * 2.54; * Descriptive stats of the new vars; PROC means data = ex1; var Wgtkg _lngcm; run;

Building Block 1: DATA step. The program is to create a SAS data set “ex1” from my SAS data set “autodata” in the library “auto.” First, start with a data statement using the keyword DATA.; then tell SAS to read “auto.Autodata” and modify it by creating two new variables and write it into a SAS data set “ex1.”

Building Block 2: PROC step. The program is to obtain some descriptive statistics using the SAS data set “ex1.” Start with a procedure statement using the keyword PROC. The name of the procedure to perform is MEANS. Use the SAS data set the DATA step above created (“ex1”). Use the newly created variables for this procedure. Conclude it with the RUN

Let’s examine what is done in the example above. Building block 1, DATA step: Data steps starts with a DATA statement.

In this step I first read an existing SAS data set named “autodata,” in my SAS library “auto” (set the data set “auto.autodata”), then modify it by creating a metric version of the variables “weight (originally in lbs)” and “length (originally in inch),” and finally write the existing data values and the newly created variables into a new SAS data set named “ex1” (the keyword data, immediately followed by the new data set name I assign). Notice that there isn’t any command in SAS like “compute (SPSS)” or “generate (Stata)” to create new variables here; this is a DATA step whose primary purpose is to read and modify data.

7

Building block 2, PROC step: PROC steps start with a PROC statement. Now, let’s move on to a PROC step and use the newly created SAS data set “ex1” for analysis. We start with the keyword PROC immediately followed by a specific procedure name. The procedure our example uses is means, which gives you descriptive information of your data set. Then options, if any, follow the procedure name. data = is an option to specify the data set name you want to use. If omitted, SAS uses the most recently created (NOT most recently used) SAS data set (so, although we explicitly specify our new SAS data set for this procedure [data = ex1], in this case it is actually unnecessary). We get information of the two new variables, i.e., weight in kilogram and length in centimeter (var Wgtkg _lngcm). The PROC step ends when SAS encounters a run statement. RUN statements are unique in that they are not part of either DATA steps or PROC steps (another example of such SAS statements is libname, as we already saw). RUN statements just notify SAS of the end of a step and tell it to run all the preceding lines of the step. Here, SAS sees run and knows that the PROC step ends and hence it needs to be executed. You might wonder, shouldn’t we have added run at the end of the DATA step above so that SAS can see the end of the step? We could. But it’s redundant in this case, because another way that SAS knows the end of a step is to encounter a new step. So, in the above, SAS encounters a PROC step right after a DATA step, which lets SAS know the DATA step ends.

This is the basic SAS program construction you first need to understand. Actually, as you learn more about SAS, you will see quite a lot of exceptions; in some cases DATA steps can generate reports whereas PROC steps can output SAS data sets (in fact, at the very end of this workshop, we will see one example where a PROC step creates a SAS data set). Still, this two-step construction is the key mindset of the SAS program you need to know first. You can go from there for exceptions. Programming Rules Now that we’ve got the big picture, we will look into a bit more details. Let’s continue to use the same simple SAS program example above to learn key SAS programming rules. 1. Each and every SAS statement must end with a semicolon.

SPSS syntax users should already have a similar mindset, but beware Stata users who don’t use any delimiter! Thanks to the Enhanced Editor’s assistance, users of SAS 8.0 or above can catch your omission of necessary semicolons with more ease than before. Compare the program below with the above example. See the difference?

data autosas set auto.Autodata; /* To metrics, to kg & to cm */ wgtkg = weight * 0.454; _lngcm = length * 2.54 run;

Missing semicolon

Missing semicolon

8

The statement starting “set” and the “run” statement appear in normal text (highlighted in gray), instead of blue color codes. That’s a big clue something you did is wrong. In this case, two needed semicolons are missing, which of course is an error. Another thing to note is that because it is semicolons that mark each statement’s end, your SAS statement can continue over multiple lines, or conversely, multiple SAS statements can be on the same line. Look at the DATA step in the example in p.7.

Data autosas; set auto.Autodata; [in-between lines omitted...]

_lngcm = length * 2.54;

In SAS, you make things look messy like that and the program still works. Of course though, it wouldn’t be recommended you be messy like that. Be organized and consistent in your programming, because in research settings, your programs are also important documentation.

2. SAS is case-insensitive.

Whether you use upper-, lower-, or mixed-cases, SAS does not care. In the example of p.7, the newly created data name appears in lowercase “autosas” in the DATA step and in mixed-case “Autosas” in the PROC step, whereas the weight variable appears in lowercase “wgtkg” and in mixed-case “Wgtkg” but it does not matter. That said, there is one instance where cases make a little difference, and that is when SAS prints variable names in outputs. Let’s see the following output that the PROC step in the example produces.

The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ wgtkg 74 1370.83 352.8458795 799.0400000 2197.36 _lngcm 74 477.3483784 56.5565034 360.6800000 591.8200000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Look closely. The weight in kilogram is in mixed-case “Wgtkg” in the program for the procedure. Yet, the output the PROC step produces shows it as “wgtkg” (circled in red). This is because when it comes to printing SAS sticks to the case when variables are first created. In our case, it is typed “wgtkg” when this weight variable is created in the DATA step, and SAS keeps using it when printing. Again, it does not affect computation results, but for organizational purposes, you better be consistent throughout your program, log, and output.

3. SAS data set names and variable names must start with a letters or an underscore (_).

As for the latter, the length in centimeter variable is named _lngcm above, for example, and that works perfectly fine. Again, just make things consistent and organized in your programming.

4. SAS data set names and variable names can include letters, numbers, or underscores, but

cannot include special characters (i.e., ~!@#$%^&*<>?=+\|()[{}]). Simply, SAS does not understand special characters in SAS data set or variable names.

9

5. SAS data set names and variable names must be equal to or fewer than 32 characters in

length. 6. You can add comments to your SAS programs. As you can see in the example, there are the

following two ways.

a. Start with an asterisk (*), place your comment, then end with a semicolon (;). * Like this;

b. Start with a slash and asterisk (/*), place your comment, and then close it with an asterisk and slash (*/). /* Like this */

Temporary/permanent SAS data sets We created a new SAS data set “ex1” in the example above. Notice, however, that we started our SAS program with the following SAS statement…

Data ex1; We only tell SAS a SAS data set name we wanted to make up, but we didn’t mention in what library (i.e., location) we wanted SAS to create and store this new data. So the obvious question is, where is the data we just created? Where in the SAS Explorer can we find this new data set? The answer is, it is in the default library “Work.” In the active libraries window, double-click on the library “Work,” and you find our data set “Ex1.”

Thus, precisely put, it is “work.ex1.” With its libref unspecified (i.e., Data ex1;), however, SAS automatically interprets the file is to be created in this library “Work” so we don’t have to mention this library name. This library is linked to SAS’s temporary work directory (you can locate it by right-clicking on the “Work” library icon and select “Property). SAS uses the location to store SAS data sets during a session, but it is temporary and hence once you close SAS to end the session, SAS data sets in the “Work” library are all deleted. Thus, files

10

temporarily saved there are temporary SAS data sets. To save your SAS data set permanently (permanent SAS data sets), create your library and store the data set there by specifying it as libref.SASDataSetName. You can move SAS data sets to another library in the Explorer by copying and pasting. Try copying Ex1 onto the library Auto. If you do this, the data is permanently accessible by saying “auto.ex1.”

2. How To Read In Data We have two subsections here in this section.

• Reading SAS system files • Reading external data files

In this and the next sections, we will focus primarily on the first building block of the SAS program, DATA steps. As we just learned in the previous section, DATA steps create SAS data sets by doing two things, (a) reading and (b) modifying data. Our discussion in this section is about (a), how to read data in SAS (and write into SAS data sets). SAS is very flexible about data reading, but we will cover the very basics of it. We will first go over how to bring SAS system files in SAS. Then we will learn how to read in external files, i.e., ASCII and files created by other applications. We create SAS data sets basically by reading data through a DATA step, but we will also touch upon how to use PROC INPORT as well. Reading SAS system files This is effectively a quick review of what we just now learned in the example of the previous section. If you already have your data in the SAS system file format (like “autodata” above), what you need to do is to create a library for the file location and tell it to SAS so that SAS knows where to find your SAS data set. Let’s take this opportunity to do just one more practice. Suppose you want to work with the SAS system file “lifeexp” which is located at the directory “C:\verybasicsas\prj_life.” The first thing to do is give this directory a libref and create a library. libname life "c:\verybasicsas\prj_life"; run; Now you should see a new library nicknamed “life” in the Explorer. Double-click on the icon and you will see the SAS data set “lifeexp.” To create another SAS data set from “lifeexp”, we use the set statement in a DATA step, as we saw a moment ago. To use this data set in procedures, tell SAS what the data set’s name is and where it is by mentioning the library name and the file name. For example,

11

proc means data = life.lifeexp; run; Reading external data files Now, how about reading external data (i.e., non SAS system file) and writing it into a SAS data set? We will cover two basic input methods: list input and column input. We will go over the following six cases:

(1) How to read in space-separated data through a DATA step? (List input) (2) How to read in delimiter-separated files through a DATA step? (List input) (3) How to read in column-arranged files through a DATA step? (Column input) (4) How to read in Excel (or other application) files through the IMPORT procedure? (5) How to read data lines from your Editor?

Let’s start with reading ASCII files ((1) to (3)). (1) How to read in space-separated data through a DATA step? We have a raw data file named “cancer_space.dat” in the “verybasicsas” folder. It has four variables, “studytime,” “drug,” “age,” and “died,” in the order that appears in the data file. The data looks like this. 1 1 . Yes 1 1 65 Yes 2 1 59 Yes 3 1 . Yes 4 1 56 No 4 1 67 Yes 5 1 63 Yes ...

As you can see, (a) this data file is cleanly separated by spaces. It has one character variable, but (b) there is no embedded space in its data value. Also, (c) missing values are clearly indicated by periods. When all these conditions are met, using list input to read such data is a way to go. Let’s create a SAS data set named “cancer_space” in our library “life.” Let’s also print the first five observations to see if we did the job right. data life.cancer_space; infile 'c:\verybasicsas\cancer_space.dat'; input studytime drug age died $; proc print data = life.cancer_space (obs=5); run; We first state “data life.cancer_space;” and then tell SAS from what raw data file we create this “life.cancer_space” data set. Instead of the set statement (which is used to read in existing SAS data sets, as we just learned in the previous sub-section), we need to use the infile

12

statement to tell SAS the filename and file path. The input statement tells SAS how to read the data. Here, we simply list variable names in the order they are in the data file (thus, list input). The $ sign after the variable “died” tells SAS that the variable is a character variable. Here’s an output of the first five observations. Obs studytime drug age died 1 1 1 . Yes 2 1 1 65 Yes 3 2 1 59 Yes 4 3 1 . Yes 5 4 1 56 No (2) How to read in delimiter-separated files through a DATA step? Delimiter-separated files are raw data files in which the values in each row are separated with specific delimiter characters. List input is used again, but as we will see, there are fewer conditions to meet for reading this type of data than space-separated data. We discuss how to get in SAS the two most common types of delimited files here, i.e., comma-separated-value and tab-delimited files. We have a data file “lifeexp_csv.dat” in the “verybasicsas” folder. It has 68 observations. Here is part of the data below. 1,Albania,1.200000048,72,810,76 1,Armenia,1.100000024,74,460, 1,Austria,0.400000006,79,26830, 1,Azerbaijan,1.399999976,71,480, 1,Belarus,0.300000012,68,2180, 1,Belgium,0.200000003,78,25380, 1,Bosnia and Herzegovina,-0.5,73,, 1,Bulgaria,-0.400000006,71,1220, 1,Croatia,-0.100000001,73,4620,63 ... 1,"Yugoslavia, FR (Serb./Mont.)",0.5,72,,

There are three points you need to note about the above data before getting down to work.

1. Each data value is delimited by commas. 2. Notice that Yugoslavia, FR (Serb./Mont.), which include a comma, is enclosed in quotes.

Also notice that the highlighted and non-highlighted observation lines have one difference. The highlighted observations have two commas in a row (e.g., “73, ,”). In fact the highlighted observations have the last two variables missing (compared to, for example, Armenia which has only the last one variable missing).

3. Although the content is the same as “life.lifeexp” above and we know it includes variables named six variables, region, country, popgrowth, lexp, gnppc, and safewater, this “lifeexp_csv.dat” does not include variable name header. The variable “country” is a string (character) variable.

13

To read this data, we need to tell SAS the above data information. Let’s see how to do so and create a SAS data set named “lifeexp_csv” at our library “life.” We first try this. data life.lifeexp_csv; infile 'c:\verybasicsas\lifeexp_csv.dat' dlm=',' dsd missover; input region country $ popgrowth lexp gnppc safewater; proc print data = life.lifeexp_csv (obs=9); run;

Again, we first state “data life.lifeexp_csv;” and use the infile statement to tell SAS the filename and file path in the case of external files. Now, by using three options following this infile statement, we deal with the first two of the three points we noted above.

1. dlm=',' tells SAS that our file uses commas as the delimiter. 2. The dsd (Delimited Separated Data) option does three things to tackle the second point.

a. Tell SAS to ignore delimiters in data values enclosed in quotes. Thus, SAS does not treat the comma in “Yugoslavia, FR (Serb./Mont.)” as the delimiter. That’s exactly what we want.

b. Also tell it to ignore quotes when it reads in the data value. Thus, SAS does not read the double quotation marks of “Yugoslavia, FR (Serb./Mont.)”. Again, this is what we want.

c. Yet also tell it to treat two consecutive delimiters as a missing value. As we noted, the highlighted observations have the last two variables missing, so this is how we want SAS to read the data.

It’s a good idea to add the option missover whenever it is safe to assume there may be missing data at the end of the data lines. This option tells SAS not to go to the next data line and instead assign missing values when/if it reaches the end of the data line while there are more variables awaiting for data values to be assigned.

The input statement following those infile options tackles the third and fourth points. We simply list variable names in the order they are in the data file, and also tell SAS that the variable “country” is a character variable by using a $ sign. Then I print the first nine observations of the data just read in to see if everything is okay. Obs region country popgrowth lexp gnppc safewater 1 1 Albania 1.20000 72 810 76 2 1 Armenia 1.10000 74 460 . 3 1 Austria 0.40000 79 26830 . 4 1 Azerbaij 1.40000 71 480 . 5 1 Belarus 0.30000 68 2180 . 6 1 Belgium 0.20000 78 25380 . 7 1 Bosnia a -0.50000 73 . . 8 1 Bulgaria -0.40000 71 1220 . 9 1 Croatia -0.10000 73 4620 63

14

It seems that the data is correctly read in—except that the character data value of the variable “country” is truncated, as highlighted. This is because with list input, reading the length is set to eight by default. Let’s modify our program a bit. data life.lifeexp_csv; infile 'c:\verybasicsas\lifeexp_csv.dat' dlm=',' dsd missover; length country $30; input region country $ popgrowth lexp gnppc safewater; proc print data = life.lifeexp_csv (obs=9); run; Here, the length statement is used before the input statement. It overrides the length specified in the input statement (in this case, the default length). Here is our new output. Obs country region popgrowth lexp gnppc safewater 1 Albania 1 1.20000 72 810 76 {…output omitted for the sake of space} 4 Azerbaijan 1 1.40000 71 480 . {…output omitted for the sake of space} 7 Bosnia and Herzegovina 1 -0.50000 73 . .

Another delimiter as often used is the tab. If your data is tab-delimited, you need to specify the delimiter as dlm='09'X. This ‘09’X represents a tab character in ASCII. data life.lifeexp_tab; infile 'c:\verybasicsas\lifeexp_tab.dat' dlm='09'X dsd missover; length country $30; input region country $ popgrowth lexp gnppc safewater; proc print data = life.lifeexp_tab (obs=9); run;

(3) How to read in column-arranged files through a DATA step? If each data value is arranged in columns, then you can get SAS to use that information to read your data. This way of reading raw data is called column input. This file format always comes with a codebook which explains where in the data each of the variables is placed. We have a data file named “lifeexp_column.dat” which is the same content of “life.lifeexp” but an ASCII file arranged in columns. Here is your codebook.

Variable Column number region 1 country 3-30 popgrowth 32-42 lexp 44-45 gnppc 46-50 safewater 51-53

15

And here is a snapshot of the data. Look how the gauge in red corresponds to the data lines described in the codebook. 1-------10--------20--------30--------40--------50--------60 1 Albania 1.200000048 72 810 76 1 Armenia 1.100000024 74 460 1 Austria 0.400000006 7926830 1 Azerbaijan 1.399999976 71 480 1 Belarus 0.300000012 68 2180 1 Belgium 0.200000003 7825380 1 Bosnia and Herzegovina -0.5 73

In the column specification approach, we can tell SAS the above information from the codebook. Run the below lines and you will get the same output as this. data life.lifeexp_column; infile 'c:\verybasicsas\lifeexp_column.dat' missover; input region 1 country $ 3-30 popgrowth 32-42 lexp 44-45 gnppc 46-50 safewater 51-53; proc print data = life.lifeexp_column (obs=9); run; Note that in this case the missover option is a must, because you have missing values at the end of some lines and hence they end earlier than the others. And no delimiter to indicate those missing values. Another approach to read a column-arranged data file is to use column pointers and informats. The same lifeexp_column.dat can also be read by the below program. data life.lifeexp_column; infile 'c:\verybasicsas\lifeexp_column.dat' missover; input @1 region 1. @3 country $28. @32 popgrowth 11. @44 lexp 2. @46 gnppc 5. @51 safewater 3.; run; The @n symbols are column pointers. @32 tells SAS that the variable country starts at column 32, for example. The variable names are each followed by informats, which tell SAS how to read data value into SAS variables. There are three major types of SAS informats.

Column pointers: tell SAS starting column numbers.

Variable names Informats: tell SAS how to read each variable.

16

Type Basic Form Note Character Informats $INFORMATw. Note: $w. (w/o SAS informat name) is to read standard

character data. Don’t forget the period at the end! INFORMATw.d Note: w.d. (w/o SAS informat name) is to read standard

numeric data. Don’t forget the period at the end! Numeric Informats

INFORMATw. Date/Time Informats Note: Don’t forget the period at the end! As for the basic form, the $ indicates a character informat. INFORMAT is the SAS informat name (see the notes in the above table for the two cases no informat name needs to be specified). The w is the width of the variable, and the d is for numeric data to specify the number of digits to the right of the decimal place. SAS has a variety of system informats to read character, numeric, date/time data values that are not in standard formats (here is a list of the built-in informats from SAS documentation). In the above example program, we use the informat $w. to read the variable “country” as a character variable with the width 28. The rest are all numeric variables read by using informat w.d. We don’t have any decimal points in four out of the five variables, so only the w. part is specified. And you don’t need the d. part for the “popgrowth” variable either, because SAS will insert a decimal point only if it does not encounter a decimal point in the specified w columns, but the original data value already has a decimal point. So the only its total width 11 is specified (11.). Data arranged in columns has some advantages over space-separated or delimiter-separated data; because you specify columns where data values are located, you have control with regard to data values you want SAS to read. We will discuss this point in the next section. (4) How to read in Excel (and other application) files through the IMPORT procedure? Now that we have covered the basics of getting ASCII files into SAS through the DATA step, let’s learn how to read in Excel files by using the IMPORT procedure. We will try the SAS import wizard first. As you can see shortly the wizard can handle files created by various applications. We then will move on to see how SAS programs do the same. Suppose we want to read lifeexp.xls in the SAS system. From the pull-down menu bar, File > Import Data…

This brings up the SAS import wizard. Drop down the source list and you can see various file types the wizard can handle. We are reading an Excel file, so choose an appropriate .xls format from the list, and you will be prompted to specify the .xls file path you want to import. Tell SAS where “lifeexp.xls” is located (below). Click OK.

17

http://support.sas.com/documentation/cdl/en/lrdict/59540/HTML/default/a001239776.htm

Then you are asked which spreadsheet you want to import. In this case, we only have one sheet “lifeexp.” Click Next. Then you are asked in which library you want to import this Excel file and what you want to call your new SAS data set. Let’s import this data into our library “life,” and call the new SAS data “lifeexp_xls” (just to differentiate it from the file “lifeexp” that is already there).

Click Next. Then SAS asks you if and where you want to save this work in a SAS program.

Even when we are using SAS’s Wizards, SAS under the hood operates on SAS programs. In this case, Import Wizards is writing PROC IMPORT statements for SAS to execute. We can save the

18

program in your Editor, so that we document it and use it repeatedly later as needed. Type in the path of your Editor file and click Finish. You should be seeing a new SAS data file named “lifeexp_xls” in the library “life.” Also, at the bottom of your Editor, the following SAS program should appear. Let’s take a look at it. PROC IMPORT OUT= LIFE.LIFEEXP_XLS DATAFILE= "c:\verybasicsas\lifeexp.xls" DBMS=EXCEL REPLACE; RANGE="lifeexp$"; GETNAMES=YES; MIXED=YES; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; The highlighted part is the most basic form of PROC IMPORT. Simply, it tells SAS information about input data (i.e., external file location and name) and output data (i.e., your resultant SAS dataset)—that’s all this procedure does for you.

1. OUT= SASdataset identifies the output SAS data set. If you want SAS to create/replace a permanent SAS file, you also need to specify your libref, as in the program above. Otherwise SAS creates your output file in the WORK library.

2. DATAFILE= "filename" specifies the external data file path. The order of these two statements can be reversed. SAS specifies all the other options as well in the above pasted program, but in this specific example, all you need to read the data is the highlighted part, and you don’t have to type the other statements if you program yourself. Among those statements, I’ll explain a couple of statements that are relatively often used and hence you should know for now. * For those interested in learning more details about those options, I would refer you to this NESUG paper.

1. DBMS= filetype specifies the file type to import. Here, SAS specifies it as EXCEL in the

above program. In this case, however, it was actually unnecessary. SAS in fact saw the extension of the file provided in DATAFILE= "filename" and already knew the file type was Excel without this option specified. The REPLACE option is to overwrite your existing file if any with the same SAS data set name exists.

2. GETNAMES= is set to “yes” by default. Meaning that PROC IMPORT by default considers the first line of the data to be the header (variable names). That was the case with our excel file, and because it is the default setting, we actually didn’t have to specify this option. Meanwhile, if the first line of your data is not for variable names, then you need to explicitly specify GETNAMES= no. In that case, the variable names assigned to your SAS data set will be var1, var2, var3….

19

http://www.nesug.org/proceedings/nesug06/io/io04.pdf

We learned to read ASCII files through DATA steps, but you can also use PROC IMPORT to read those files as well. When SAS see in DATAFILE= "filename" a comma separated value file with the extension .csv, SAS understands it as DBMS = csv, and if a tab delimited file with the extension .txt is specified in DATAFILE= "filename" SAS sees it as DBMS = tab. For anything else, set DBMS = DLM first. SAS by default treats DBMS = DLM as space-delimited, so you need to add another option DELIMITER="delimiter character" to specify any other delimiter. (5) How to read data lines from your Editor? Suppose you have the following data lines, seven observations and four variables: 1 1 . Yes 1 1 65 Yes 2 1 59 Yes 3 1 . Yes 4 1 56 No 4 1 67 Yes 5 1 63 Yes And here is what you do to read this data directly from the Editor. data life.cancer_direct; input studytime drug age died $; datalines; 1 1 . Yes 1 1 65 Yes 2 1 59 Yes 3 1 . Yes 4 1 56 No 4 1 67 Yes 5 1 63 Yes; run; Instead of infile which shows your external file path, you use the datalines statement (or, alternatively, you can use cards), followed by the data lines you want to bring in.

3. How To (Read And) Modify Data Here are four topics we will cover in this section.

• How to create/recreate variables? • How to subset your data? • How to combine your SAS data sets? • How to label and format your variables?

20

We will continue to talk about the DATA step in this section. As we have learned, DATA steps typically (a) read and (b) modify data and write it into a SAS data set. Here in this section, we will also discuss (b) along with (a). It is about how to work further on your data in a DATA step. We will first take a little more look at how to create/recreate variables and then get an overview of how to subset your data. Along the way, we start using functions and IF-THEN/ELSE statements through the DATA step. One thing I want you to remember here first; when we are talking about modifying/subsetting data, there are two points that can happen. First, when you read the data from your SAS data set or raw data file, and second, when you write the data values into a newly created SAS data set. This is an important point when it comes to efficiency. Suppose you don’t need a variable named “A” in your new SAS data set. You don’t have to (and want to) read this unnecessary variable in the first place. But if you want to create a new variable from A and write the new variable into a new SAS data set, you must read the variable once for your computation work, even if you don’t want to write A itself in the final product. I’ll mention this read/write thing in this section, so keep it in mind. After talking about how to subset data, we will also learn the opposite, i.e., how to combine data sets. We mostly focus on DATA steps, but will touch upon a useful procedure as well. And finally, we will briefly discuss how to label and format variables and store them through the DATA step. How to create/recreate variables? We already got a glimpse of how to create new variables. Use the below basic form to create/redefine new variables through a DATA step.

varname = expression; To SPSS or Stata users: Note again that SAS doesn’t need commands equivalent to COMPUTE or –generate-/-replace- to create/redefine variables. data a; infile 'c:\verybasicsas\lifeexp_csv.dat' dlm=',' dsd missover; length country $30; input region country $ popgrowth lexp gnppc safewater; datasource = 'wb'; /*character constant*/ dataver = 2; /*numeric constant*/ safewater2 = safewater; /*duplicate with diffe sqgnppc = gnppc*gnppc; /*multiplication, squared term*/ proc print data = a (obs=9); run;

21

The above program reads the variables, creates new variables from the read-in variables, and writes the original and the new variables into a new SAS data set “a.” See here for other SAS arithmetic operators than multiplication. There are other tasks that the above arithmetic operators cannot handle, however. For example, you may want to test a curvilinear effect of per capita GNP specified as a logarithm, rather than a squared term, and thus want to create a logged version. Or you may want to turn all letters in the “country” variable to uppercase. SAS functions do such jobs for you. data b; set a; lngnppc = log(gnppc); /* log of gnppc */ country = upcase(country); /*upper-case of country */ proc print data = b (obs=9); run; See your output; you see two new variables printed there for the first nine observations. As you just saw, the basic form of SAS functions is:

function-name(argument(s)...) You always need parentheses for any SAS function. SAS has so many other functions, of course, which let you handle numeric, character, and date variables. Here is a list. Finally, let’s talk about how to create variables based on specific conditional logic. Suppose that you want to create a new variable about countries’ development status. Suppose that you define developing countries as those with their per capita GNP under $10,000 and code countries that meet the criterion as “developing.” To do this, use IF-THEN statement. The basic form is: if condition(s) then action; The condition, of course, is that per capita GNP is under $10,000, and the action we want SAS to take is to create a variable where developing countries are indicated as defined. So, we first program as below and submit it to SAS. data c; set b; if gnppc < 10000 then devstatus = 'developing'; proc print data = c (obs=9); var country gnppc devstatus; run; And here’s the output from proc print.

22

http://www.ssc.upenn.edu/scg/sas/verybasicSAS_operator.pdf


Obs country gnppc devstatus 1 ALBANIA 810 developing 2 ARMENIA 460 developing 3 AUSTRIA 26830 4 AZERBAIJAN 480 developing 5 BELARUS 2180 developing 6 BELGIUM 25380 7 BOSNIA AND HERZEGOVINA . developing 8 BULGARIA 1220 developing 9 CROATIA 4620 developing Notice that the new variable “devstatus” is coded as “developing” for Bosnia and Herzegovina, despite the fact the per capita GNP variable is missing for this country. Why does this happen? It happens because in SAS, a missing value (“.”) represents the smallest possible number (i.e., negative infinity) and hence in the above example Bosnia and Herzegovina met the condition of if gnppc < 10000. So we modify the program like this:

if gnppc < 10000 & gnppc > . then devstatus = 'developing'; Print the first nine observations and check how the “devstatus” value has changed for Bosnia and Herzegovina. This example actually shows you how to specify multiple logical conditions as well. In the above example, the data values must meet both the conditions, so the Boolean operator to use is &:

if condition & condition then action; Comparison operators and the Boolean operators can be mnemonic or symbolic. Here is a list of the comparison operators and the Boolean operators allowed in SAS. With the if condition(s) then action; statement, you can get SAS to take only one action. If you want to execute multiple actions under the same condition, use the DO and END keywords inside the IF-THEN statement. if condition then do; action; action;

end; For example, suppose that after working on the data for some time, someone kindly sent you the data of per capita GNP (let’s suppose it’s $3,000) and access to safe water (likewise, 97%) for Bosnia and Herzegovina, which are missing in the original data. Add the two data values and also add the two curvilinear terms of the GNP variable for this country.

23

http://www.ssc.upenn.edu/scg/sas/verybasicSAS_operator.pdf

if country = 'BOSNIA AND HERZEGOVINA' then do; gnppc = 3000; safewater = 97; sqgnppc = gnppc*gnppc; lngnppc = ln(gnppc);

devstatus = 'developing'; end;

Sometimes you may need to recode variables by grouping observations. For example, we coded countries with lower than $10,000 per capita GDP as “developing” yet those countries that do not fall in this criterion were not coded. Suppose you want to classify all the countries into three categories: low-income (let’s for now define it as lower than $3,000 per capita GDP = 1), middle-income ($3,000 to lower than $10,000 = 2), high-income ($10,000 = 3). To group the variable this way, you should use IF-THEN/ELSE statements.

if condition then action; else if condition then action; else if condition then action; else action;

This way, your categories are automatically mutually exclusive, because once an observation meets a condition SAS skips the rest of the statements for that observation (so, this also helps speed up the processing). The final else action; will include all the remaining observations that do not meet any of the previous conditions. We group the countries into the income groups. data life.lifeexp; set life.lifeexp; if gnppc = . then income = .; /* we allow missing values here*/ else if gnppc < 3000 then income = 1; else if gnppc >= 3000 & gnppc < 10000 then income = 2; else income = 3; run; Feel free to print and/or open the SAS data to see the new variable “income.” As noted, SAS treats a missing value (“.”) as the smallest possible number (i.e., negative infinity), so it is always safe to allow missing values in your IF-THEN/ELSE statements as in the example, unless you are absolutely sure your variable does not have any missing value. How to subset your data? Sometimes we don’t need write the whole data in our SAS data set. There are two such cases:

(1) Subset by columns: You may want to remove from your SAS data set variables that you are sure will not be part of your analysis.

(2) Subset by rows: You may keep only those observations that meet certain criteria.

24

Let’s take a look at those cases one by one. (1) Subset by columns Let’s continue to use the life expectancy data. The data contains six variables: region, country, population growth, life expectancy, per capita GNP, and safe water. Suppose you decide you don’t need “popgrowth” and “safe water” in your SAS data set. Let’s create from the file “lifeexp_csv.dat” a temporary SAS data set without these variables. You use list input to read this delimiter-separated data. data d; keep region country lexp gnppc; infile 'c:\verybasicsas\lifeexp_csv.dat' dlm=',' dsd missover; length country $30; input region country $ popgrowth lexp gnppc safewater; proc print data = d (obs=9); run;

As you see in your output (omitted here for the sake of space), the new SAS data set “d” does not have the variables “popgrowth” and “safewater.” The keep statement in a DATA step lets you write only the variables you specify into your SAS data set. Conversely, you can just drop “popgrowth” and “safewater” by usingn the drop statement.:

drop popgrowth safewater; In the example above, we read a raw data, but these two statements also works if you select variables to keep from existing SAS data sets. Now, remember that I mentioned about data subsetting happening either at the point of data reading or data writing. The keep/drop statements belong to the latter. That is, SAS reads all the data once anyway, but only writes the variables you want into a new SAS data set. This means you have SAS reading unnecessary data first before writing the specified variables only into a SAS data set, so it’s not an efficient way. But unfortunately you have no other choice in list input. But also remember that I mentioned column input has advantages because you have control with regard to specifying data value locations? If your data is arranged in columns, you can specify the columns to read only the variables you want. No need to read everything, so it’s more efficient. See the below example. data e; infile 'c:\verybasicsas\lifeexp_column.dat'; input @1 region 1. @3 country $28. @44 lexp 2. @46 gnppc 5.; proc print data = e (obs=9); run;

25

Notice the program inputs only the variables we are interested in. You can do this because you have power to specify data value locations. The rest of the data are not even read in (not to mention written into a new SAS data set). Note that even if you don’t need the “safewater” variable in your new data set, you need to read it once if you need “safewater” to create a new variable (simply, you can’t use it to create a new variable if you don’t read it!). In that case, use either the keep/drop statements not to write the “safewater” variable itself. Finally, let’s see how to create a subset of your existing SAS data set. I’ll show three examples. data f; set life.lifeexp (drop = popgrowth safewater); run; data f (drop = popgrowth safewater); set life.lifeexp; run; data f; set life.lifeexp; drop popgrowth safewater; run; The above three produce the same data, but the processes to the end product are different. In the first example, the (drop = ) option of the set statement is used. In that case, SAS does not read the two variables from the SAS data set life.lifeexp. So, if you don’t need to work with these variables in the new SAS data set f, this is the most efficient way. In the second one, the (drop = ) option of the data statement is used, whereas in the third example the drop statement is used. Either way, you have SAS reading the entire life.lifeexp data set first, including those two variables. If you need those variables only to create new variables but don’t need them per se, then use either one of the two. In all the three examples, you could conversely use (keep = ) or keep instead. (2) Subset by rows Let’s talk about selecting only the observations that meet your criteria. Suppose that in your cross-national research of life expectancy, you decide to focus on developing countries and hence want to keep developing countries only into your SAS data set. Suppose also that you continue to use the definition of developing countries as those with their per capita GNP under $10,000. To create this subset of developing countries from “lifeexp_csv.dat”, we can use the IF statement in our DATA step. The basic form of the IF statement to keep observations that meet your selection criterion/criteria is

if condition(s);

26

condition(s) specifies your selection criterion/criteria. In this IF-clause you don’t have the “then action” part, but SAS knows it means “then keep” in selecting observations. The SAS program below reads the data file “lifeexp_csv.dat”, selects observations against our criterion, and creates a SAS data set “life.lifeexp_developing.” * Create a life expectancy data for developing countries; data g; infile 'c:\verybasicsas\lifeexp_csv.dat' dlm=',' dsd missover; length country $30; input region country $ popgrowth lexp gnppc safewater; if gnppc < 10000 & gnppc > . ; /* I define developing countries as those with per capita GNP under $10,000 */ proc print data = g (obs=9); run; Conversely, we can delete those countries with per capita GDP 10,000 or higher. In that case, type in:

if gnppc >= 10000 then delete; Again, though, in list input, you read all the data once, then write the only observations into your new SAS data set that meet your selection criteria. Column input has an advantage here again. * Again, strength of column arranged file; data h; infile 'c:\verybasicsas\lifeexp_column.dat'; input gnppc 46-50 @; if gnppc >= 10000 then delete; input region 1 country $ 3-30 popgrowth 32-42 lexp 44-45 safewater 51-53; proc print data = h (obs=9); run; Notice that two input statements are used. Also notice that in the first input statement, we have an @ mark trailing the column specification in the first input statement. This trailing @ tells SAS to hold there after reading the value of “gnppc” (otherwise SAS would automatically move on to read the next observation). Holding the line there, SAS decides whether to read the current observation or not. If it has more than $10,000 per capita GNP, SAS decides not to read that entire observation. If the observation has $10,000 or less per capita GNP, SAS goes on to read the rest of the variables (the second input statement). This way, you don’t have to read all the observations; you can read only the observations that meet the selection conditions. Finally, how do we extract the developing countries from the existing SAS data set “life.lifeexp”?

27

* Do the same from the existing SAS data set; data i; set life.lifeexp; where gnppc < 10000 & gnppc > . ; /* Use "where" instead of "if"*/ proc print data = i (obs=9); run; Notice that in the DATA step, the where statement is used instead of the if statement. Either statement works when you read observations from existing SAS data set—both test the selection condition. The difference is, again, whether all the observations are read or only the observations that meet the criteria are read. If the where statement is used, SAS does not read unnecessary observations. Therefore it is more efficient in subsetting data. The if statement, in contrast, is to read all the observations once first. So the general recommendation is to use where rather than if when you select observations from existing SAS data sets. But the same tradeoff I mentioned above applies here as well. In subsetting the where statement can only use the values of variables the DATA step is reading, whereas the if statement allows you to set selection conditions based on the values of variables you create in the current DATA step. In that case, you need to use the if statement. How to combine SAS data sets? Now that we have talked about subsetting data, let’s talk about how to do the opposite, i.e., how to combine SAS data sets. Just like subsetting, there are two dimensions to work on.

(1) Merging related SAS data sets (combine columns): You may want to add new variables to your data.

(2) Appending SAS data sets (combine rows): You may want to add more observations to your data.

Let’s see how to do these two tasks. (1) Merging related SAS data sets Merging two related SAS data sets is a simple task. If you want to combine related SAS data sets, use the merge statement in the DATA step. Usually you need to use, as a unique identifier of each observation, one (or a combination of two or more) variable(s) that your SAS data sets have in common with the same variable name. Your matching variable must be sorted first in the same way in both of the source SAS data sets so that SAS can correctly match the observations. Let’s see an example. Suppose we have received from a research colleague a SAS data set named “life.physicians” that includes the number of physicians per 1,000 people. We want to combine this data with our “life.lifeexp” SAS data set. The two data sets have a variable “country” in common, coded in the same way, so we can use this variable as a matching variable.

28

We first sort those two SAS data sets by the “country” variable and then combine the two SAS data sets into a new data set named “trymerge” below. proc sort data = life.lifeexp; by country; proc sort data = life.physicians; by country; data trymerge; merge life.lifeexp life.physicians; by country; proc print data = trymerge (obs = 9); var country physicians; run; To sort your SAS data set, use proc sort. By default SAS sorts your data ascendingly. You need to specify by what variable you want to sort your data by using the by statement. Notice that in the DATA step, we don’t have either infile or set statements; instead we set the two source SAS data sets by using the merge statement. Then in the by statement, we tell SAS by what matching variable the two data sets are to be merged. The basic form below, for summary:

data new_SAS_data_set; merge SAS_data_set_1 SAS_data_set_2; by matching_variable(s);

run; And the output below. Obs country physicians 1 Albania 1.37410 2 Argentina . 3 Armenia 3.92440 4 Austria 2.20000 5 Azerbaijan 3.91720 6 Belarus 3.56040 7 Belgium 3.30000 8 Bolivia 0.45000 9 Bosnia and Herzegovina 1.57030 Now, when you merge two SAS data sets, the resulting new data set can have three groups of observations: (1) Observations that only SAS_data_set_1 contributed to the new data set; (2) observations that only SAS_data_set_2 did; (3) observations that the both SAS data sets did. If you have come from Stata, you know Stata’s –merge– automatically tracks which of the source data sets contributed each observation in the new data set and creates a tracking variable _merge.

29

In SAS, for that purpose, you can use the (in = ) option on the merge statement when combining SAS data sets. Place this option right after the name of the SAS data set you want to track and specify a tracking variable name within the parenthesis. For example, merge life.lifeexp (in = inlife) life.physicians (in = inphy); In the above example, SAS will create temporary variables called “infile” and “inphy” which will be deleted after the current DATA step. The tracking variables are binary-coded, where 1 means the data set contributed its observations to the new data set, whereas 0 didn’t. Using this option, you could create new variables. If, for example, you want to create an equivalent of Stata’s _merge variable, then you can do this. data try_mergevar; merge life.lifeexp (in = inlife) life.physicians (in = inphy); by country; if inlife = 1 & inphy = 1 then _merge = 3; else if inphy = 1 then _merge = 2; else _merge = 1; run; I create a new SAS data set “try_mergevar” and print all the variables for the first 15 observations below (five cases in the middle omitted for the sake of space). You see the _merge variable on the right hand side. Notice that the variables “inlife” and “inphy” are not there; as I said above, tracking variables are temporary and they are gone when the current DATA step is completed. Obs country region popgrowth lexp gnppc safewater income physicians _merge 1 Albania 1 1.20000 72 810 76 1 1.37410 3 2 Argentina 3 1.40000 73 8030 65 2 . 1 3 Armenia 1 1.10000 74 460 . 1 3.92440 3 4 Austria 1 0.40000 79 26830 . 3 2.20000 3 5 Azerbaijan 1 1.40000 71 480 . 1 3.91720 3 (Output of Obs 6 to 10 omitted for the sake of space) 11 Bulgaria 1 -0.40000 71 1220 . 1 3.16960 3 12 Canada 2 1.20000 79 19170 99 3 2.10000 3 13 Chile 3 1.60000 75 4990 85 2 1.10000 3 14 Colombia 3 2.00000 70 2470 78 1 1.09000 3 15 Costa Rica . . . . . . 1.26000 2 If you are sure that you only need the countries covered in the first SAS data set, then you can use the tracking variable to subset your data accordingly through a DATA step from the beginning. Let’s run the program below and create a new data set “life.lifeexp_morevar”. data life.lifeexp_morevar; merge life.lifeexp (in = inlife) life.physicians; by country; if inlife = 1; run;

30

Note that the where statement cannot be used here even though we read SAS data sets in and write a new SAS data set, because the subsetting task is based on the variable created in this DATA step, not on any variable in either of the source data sets (read here again if this does not make sense…). The resulting new SAS data set “life.subset” keeps the 68 countries of the first data set “life.lifeexp” while adding the new variable “physicians” wherever available. (2) Appending SAS data sets Suppose you obtained a SAS data set “life.lifeexp_moreobs”, which contains more country observations for the same variables. Suppose also you want to stack it with the SAS data set “life.lifeexp.” The “life.lifeexp_moreobs” file has 10 observations (feel free to run proc print or to open the file directly from the Explorer to see the content of the file). Let’s first see how to do this in a DATA step. The task is actually easy—you just need to use the set statement and set the two SAS data sets there. We know the set statement already. What’s new to learn here is the set statement allows you to concatenate any number of SAS data sets. data life.lifeexp_xpand; set life.lifeexp_morevar life.lifeexp_moreobs; run; The program stacks the two SAS data sets in the order that they are set in the set statement. Here is the printout of the “country” variable (partly omitted for the sake of space). Obs country 1 Albania 2 Argentina (Omitted in between for the sake of space) 66 Uzbekistan 67 Venezuela 68 Yugoslavia, FR (Serb./Mon.) 69 Algeria 70 Bangladesh 71 Botswana 72 Burundi 73 Cambodia 74 China 75 India 76 Kenya 77 Oman 78 Syrian Arab Republic Now, the original SAS data sets are each sorted by country, but when appended this way, the resulting data set loses that alphabetic order. You could use proc print for the new data again, of course, but you in fact you don’t have to—just let SAS know your sort variable by adding the by statement in this DATA step.

Obs 1-68 are from the first SAS data set.

Obs 69-79 are from the second SAS data set.

31

data life.lifeexp_xpand; set life.lifeexp_morevar life.lifeexp_moreobs; by country; run; Print the “country” variable and see that the resulting SAS data set is already sorted. Obs country 1 Albania 2 Algeria 3 Argentina 4 Armenia 5 Austria 6 Azerbaijan 7 Bangladesh 8 Belarus 9 Belgium

Obs from the second data.

Obs from the second data.

As mentioned, the set statement allows you to concatenate any number of SAS data sets. But if you want to stack two SAS data sets with the same variables and attributes, proc append may be a more efficient way.

proc append base = SAS_dataset_1 data = SAS_dataset_2; run;

The reason this procedure may be more efficient for stacking two data sets is because of the number of data sets SAS reads in; with the set statement SAS reads all the observations from both of the two SAS data sets specified, whereas with proc append statement it reads the observations from SAS_dataset_2 . Hence, basically, if you have big data sets, proc append is efficient (just set the bigger one for the base = option in that case). For more details and strengths and weaknesses of this procedure compared with the set statement in the DATA step, here is one tips paper. How to label and format your variables? Finally, we will briefly discuss two things, (1) how to label your variables and (2) how to associate formats to your variables (i.e., how to label the values of your variables). For Stata and SPSS users, these nearly correspond to variable labels and value labels. However, SAS’s value labels (called “format”) demand a little different mindset, since there is a lot more to it than simple labeling. Here, we will cover just the very basics of the SAS format. (1) Variable labels OK, let’s first start with labeling variables. You can document your variables by using label statement in DATA steps. SAS allows for 256 characters (including blanks) long for a label.

32

http://www2.sas.com/proceedings/forum2008/085-2008.pdf

data life.lifeexp_xpand; set life.lifeexp_xpand; label popgrowth = 'Population growth %' lexp = 'Life expectancy' gnppc = 'GNP per capita $' safewater = '% of pop with access to safe water' physicians = '# of physicians/1,000 pop' income = 'Income level'; run; When you do this in a DATA step, your variable labels will be stored in the data set and printed in your output. Let’s get basic descriptive statistics for a couple of variables from the same data. proc means data = life.lifeexp_xpand; var popgrowth lexp gnppc; run; And you see now the labels are printed (below). Variable Label N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------------------------ popgrowth Population growth % 78 1.1970973 1.0755506 -0.5000000 3.5553560 lexp Life expectancy 78 70.8461538 6.3636078 46.0000000 79.0000000 gnppc GNP per capita $ 72 7932.44 10188.09 350.0000000 39980.00 ------------------------------------------------------------------------------------------------ The label statement in a PROC step also allows you to produce customized outputs with variable names labeled as well, but the labels are not saved in the data set and effective for the duration of that PROC step only. (2) Value labels (Format) Now, let’s talk about SAS formats (= value labels). Unlike SAS informats which are used to tell SAS how to input data, formats are to tell SAS how to output data, i.e., how to print data values. First off, let’s first start with SAS’s system (i.e., built-in) formats. SAS system formats can be used in SAS procedures to change the appearance of your printed values. What we do here is to make an association between a format and a variable to make a customized output. For example, the below program associates the format dollarw.d. with the per capita GNP variable in the format statement, so that the variable is printed with a $ sign and commas separating every three digit (just like informats, don’t forget to place a comma at the end of each format name. That way SAS can tell actual variables and format names).

33

proc print data = life.lifeexp_xpand (obs = 9); format gnppc dollar7.; var country gnppc; run; Here’s your output. Obs country gnppc 1 Albania $810 2 Algeria $4,350 3 Argentina $8,030 4 Armenia $460 5 Austria $26,830 6 Azerbaijan $480 7 Bangladesh $510 8 Belarus $2,180 9 Belgium $25,380 The association between the format dollar7.and the “gnppc” variable is not stored in the data if used in a PROC step; it works only for the duration of the PROC step just to produce a formatted output. You can, however, use the format statement in a DATA step to save this association in the data. This way, you don’t have to use the format statement in each PROC step thereafter. data life.lifeexp_xpand; set life.lifeexp_xpand; format gnppc dollar7.; proc print data = life.lifeexp_xpand (obs = 9); var country gnppc; run; See your output; without the format statement in proc print, the per capita GNP is formatted as stored in the data set. SAS has many character, date/time, numeric formats available for use like this. You have a list here. Now, aside from the SAS system formats, sometimes you might want to create your own formats that specifically fit your variables. For example, we created a variable “income” some moments ago, which codes each country’s income level into three categories with the coded values ranging from 1 to 3. Let’s print the first six observations for this variable. Obs income 1 1 2 2 3 2 4 1 5 3 6 1

34


In the output, what the code means is unclear without the data’s codebook. A better idea would be to create our own format, so that the meanings of the coded values are clear at a glance. To create your own format, use proc format. This procedure does not generate output; it just creates and catalog formats either temporarily or permanently (we’ll talk about it shortly). Let’s create a format for the income variable and also for the region variable as well. proc format; value incomegrp 1 = 'Low' /* value labels for income */ 2 = 'Middle' 3 = 'High'; value regionname 1 = 'Europe' /* value labels for region */ 2 = 'N Amr' 3 = 'S Amr' 4 = 'Sub-S Afr' 5 = 'Asia' 6 = 'M East'; run; As you can see, the basic form of this procedure is:

proc format option; value format_name range1 = 'text1' range2 = 'text2' ...; The format name follows right after the value statement. Then you define the text content of that format. Don’t forget to enclose character values in quotes. The range specifications in the value statement can take many forms (a couple of quick examples here). Once you create your own format, you can associate your formats with variables by using the format statement, just like you did with the built-in format dollarw.d.in the previous example. proc print data = life.lifeexp_xpand (obs = 6); format income incomegrp.; var income; run; And you get this output. Obs income 1 Low 2 Middle 3 Middle 4 Low 5 High 6 Low

35

http://www.ssc.upenn.edu/scg/sas/verybasicSAS_formatex.pdf

The formats we created (i.e., incomegrp, regionname) are actually temporary and effective only for the duration of the current SAS session. As such, formats are placed in a temporary format catalog located in SAS’s “work” library, just like temporary SAS data sets.

Go to the Explorer and click open the “Work” library to find the catalog “formats” and you find the two formats you just created.

And temporary formats will be deleted after the duration of the current SAS session, again, just like temporary SAS data sets. This means that you would need to re-create the same formats every time your job is run. Otherwise, they are missing, and if you associated your temporary format with your variables in a DATA step and stored the association in your data, that association cannot be invoked in a new session, and in that case, SAS will not let you do anything with the data file, just issuing an error message “ERROR: Format <formatname> not found or couldn't be loaded for variable <variable name>” and the data set even does not open! Thus, if you want to use your formats in another session or in another program without recreating them every time, there are two things you need to do. First, you need to save your formats so that you can use them permanently. Second, once you save your formats, you need to load your permanent format catalogs when you want to use them. In what follows I’ll explain these two points. Finally, after discussing these points, we will also learn how to open your SAS data set when the associated format is missing. To save your formats, use the library option on proc format to specify a catalog storage libref.

proc format library = libref.catalogname; Three things to note:

1. If the library option is unspecified as in the previous proc format example above, your formats are stored in the WORK.FORMATS catalog (as you know, WORK is SAS’s temporary library, and FORMATS is a default catalog name).

2. If you specify library = libref only, the formats are stored in libref.FORMATS catalog. 3. If you specify library = libref.catalogname, the formats are stored in that catalog.

So, for example,

libname myfmt 'c:\verybasicsas\myfmt'; proc format library = myfmt;

will create a permanent catalog named “myfmt.formats,” an

libname myfmt 'c:\verybasicsas\myfmt'; proc format library = myfmt.cat1;

36

will create a permanent catalog named “myfmt.cat1.” So, let’s first try creating a new library and name it “myfmt” and save our format there: libname myfmt 'c:\verybasicsas\myfmt'; proc format library = myfmt; value incomegrp 1 = 'Low' 2 = 'Middle' 3 = 'High'; run;

Find in the Explorer a new format catalog “myfmt.formats.” Now, however, this program… data life.lifeexp; set life.lifeexp; format income incomegrp. Run; … still does not work; SAS will still say “ERROR: Format REGIONNAME not found or couldn't be loaded for variable region.” Why? Because SAS does not know where your permanent format incomegrp. is saved. To help SAS find it, use a SAS system option called fmtsearch.

options fmtsearch=(item_1, item_2, item_3..., item_n); … where item is either libref or libref.catalogname. If libref only is specified, FORMATS is assumed as the catalog name. Then SAS searches in that libref or libref.catalog specified in item for your permanent formats. So you can do: options fmtsearch=(myfmt); data life.lifeexp_xpand; set life.lifeexp_xpand; format income incomegrp.; run; Then SAS will find the format and use it. Another way is to use the special libref “library” (instead of a user-defined libref like “myfmt” in the above example) in proc format. If you do this, permanent formats can be accessed directly without the help of the fmtsearch option. See below.

37

libname library 'c:\verybasicsas\myfmt'; proc format library = library; value regionname 1 = 'Europe' 2 = 'N Amr' 3 = 'S Amr' 4 = 'Sub-S Afr' 5 = 'Asia' 6 = 'M East'; data life.lifeexp_xpand; set life.lifeexp_xpand; format region regionname.; run; SAS searches for formats in the following order: System (built-in) formats catalog → WORK.FORMATS → LIBRARY.FORMATS → user-defined libref.catalogname. Thus, when you are using a SAS data set with associated format libraries, a good idea would be to start your program with:

libname library 'whatever directory you specify & store fmts'; libname your_libref.catalogname 'whatever directory you specify & store fmts'; options fmtsearch = (your_libref);

Then SAS’s search will cover all the system, temporary, and permanent formats. As it is clear now, SAS saves format associations in data sets, but not the formats themselves; the formats are saved in separate files. This means you need to have your SAS data set and associated format files together in a way SAS can find the formats. Unfortunately, however, it sometimes happens the two are separated—like, for example, somebody kindly sent you a SAS data set for your research, but without the associated format files—and you cannot open such a SAS data set! Good news is there is a way to open SAS data sets without the associated formats. Just use SAS system option nofmterr. SAS opens and uses the data without the associated formats. Try the following and see the output. options nofmterr; proc print data = life.lifeexp_xpand; run;

4. How To Get Descriptive Statistics So far, we focused on DATA steps, the first building block of the SAS program. From this section on, we also start learning more about the second building block of SAS programs, PROC steps. As we saw in the overview, PROC steps typically use SAS data sets created through DATA steps to conduct analysis and produce report. In this section, we will explore data to get a good idea what it looks like—what its central tendency is, how it is distributed, etc.—before getting down to data analysis. The below are the procedures we will cover in this section.

38

(1) PROC CONTENTS (2) PROC MEANS (3) PROC SORT (4) PROC UNIVARIATE (5) PROC BOXPLOT (6) PROC FREQ (7) PROC CHART (8) PROC CORR (and ODS GRAPHICS) (9) PROC INSIGHT

Remember, we will cover the basic, most common usages of those procedures. For further information, use the help menu. We will use the SAS data set life.lifeexp_xpand. We first take a look at the information about the data by using proc contents. This is an easy and quick way to get a description of your data. proc contents data = life.lifeexp_xpand; run; The output looks like a table of contents for the data; it shows you the information that is stored in the data: the data set name, the number of observations and variables in the data, the date the data was created/modified, variable names, type, lengths, formats, labels, and so and so forth. If you store informats in the data, the information about them will also be displayed. proc means and proc univariate give you a pretty good idea about your data and they are probably the first two procedures to use for data exploration. We already used proc means. As you saw in that example, proc means by default gives you N, mean, standard deviation, minimum and maximum. Thus, you need to explicitly request for other statistics (if you need them) by specifying options for this procedure. proc means data = life.lifeexp_xpand n nmiss range mean median skew; run; The following list is not exhaustive but it covers some of frequently used statistics options. n Number of observations mode Mode mean Mean median (or p50) Median (Percentile 50) stddv (or std) Standard deviation var Variance cv range Coefficient of variation Range min Minimum stdeff Standard error max skew Maximum Skewness nmiss Number of missing observations kurtosis (or kurt) Kurtosis

39

You can perform separate analyses for different groups. Also, you can get information for specific variables. Suppose you want to get region-by-region descriptive statistics (N, mean, standard deviation, minimum, skewness) for life expectancy and per capita GNP. proc means data = life.lifeexp_xpand n mean std skew; var lexp gnppc; class region; run; And you get the following output. N region Obs Variable Label N Mean Std Dev Skewness ----------------------------------------------------------------------------------------------- Europe 44 lexp Life expectancy 44 73.0681818 4.1506387 -0.1834883 gnppc GNP per capita $ 41 10738.05 11793.66 0.9110311 N Amr 14 lexp Life expectancy 14 71.2142857 6.4112710 -1.5544574 gnppc GNP per capita $ 12 5817.17 8929.48 2.2339283 S Amr 10 lexp Life expectancy 10 70.3000000 3.8311588 -1.0278681 gnppc GNP per capita $ 10 3645.00 2254.53 0.7757098 Sub-S Afr 3 lexp Life expectancy 3 56.0000000 8.8881944 -1.3458330 gnppc GNP per capita $ 3 2036.67 2396.59 1.5983736 Asia 4 lexp Life expectancy 4 59.5000000 6.6080759 1.5595074 gnppc GNP per capita $ 3 723.3333333 187.1719352 -1.5339748 M East 3 lexp Life expectancy 3 68.3333333 1.5275252 0.9352195 gnppc GNP per capita $ 3 5446.67 4038.27 1.1319316 ----------------------------------------------------------------------------------------------- The var statement lets you specify variables you want, and the class statement specifies a grouping variable(s). You can use the by statement instead of the class statement; you will get the same statistics (although the output looks more compact when class is used). However, to use the by statement, the data first needs to be sorted by that variable. Use proc sort in that case. Another point I want to mention is how to get descriptive information for a subset of the data. Say, suppose you want to get descriptive information for countries with per capita GNP above $10,000. Do we have to create a subset file by going through a DATA step? Not really. proc means data = life.lifeexp_xpand n mean std min skew; var lexp gnppc; where gnppc > 10000; run; The where statement can be used for any procedure to use a subset of your data. So, this is a very efficient way because you don’t have to create a lot of SAS data sets every time you want to examine different subsets.

40

Let’s talk about proc univariate, another nice way to explore your data. This simple procedure by default gives you more descriptive information about the central tendency and distribution, as well as missing values and extreme observations. Let’s do this for the per capita GNP variable. proc univariate data = life.lifeexp_xpand; var gnppc; run; See your output. proc univariate has some useful options and statements. Type in the following program and submit it. What additional information did you get? proc univariate data = life.lifeexp_xpand plot normal; var gnppc; id country; run; The plot option generates three types of ASCII plots to describe the distribution of your data: stem-and-leaf, box, and normal probability plots (see below). Meanwhile, the normal option gives you normality test statistics (the 4th item from the top).

Meanwhile, the id statement is helpful when you want to identify extreme observations. Without this statement, SAS only generates a list of the top and the bottom observation numbers and their values. If you want to know which countries are those extremes, for example, specify the id variable as “country,” as in the example program.

41

Without the id statement With the id statement

Just like the case of proc means, you can use the by statement or the class statement in proc univariate to get category-by-category statistics. The difference between by and class when used in proc univariate is that using the by statement SAS also produces for you group-by-group comparative ASCII box plots as below. proc univariate data = life.lifeexp_xpand plot normal; var gnppc; by region; run;

42

Another useful statement in proc univariate to check distributions is histogram, which lets you create a simple histogram as the statement name suggests. In the below program, I specify the normal and the kernel options of the histogram statement to fit a normal curve and a kernel density curve. I also use the noprint option on proc univariate to suppress the tables of statistics. The histogram generated follows right below the program. The midpoints option is to specify a list of numbers to use as midpoints for the horizontal axis. proc univariate data = life.lifeexp noprint; histogram gnppc / normal kernel midpoints = 5000 to 40000 by 5000; run; And here’s your output.

In addition to proc means and proc univariate, there are other useful procedures for data exploration. proc freq generates frequency tables (including cross tabulations) for you. Frequency tables often also help you check and detect data errors, aside from its primary purpose of obtaining the categorical data distribution. Let’s first see the distribution of the variable “income.” proc freq data = life.lifeexp_xpand; tables income; run; Note the statement to specify your variable is tables, instead of var, in proc freq. The below is your frequency table for “income.” Notice that at the bottom the number of missing values. SAS automatically removes missing values from the table.

43

Income level Cumulative Cumulative income Frequency Percent Frequency Percent ----------------------------------------------------------- Low 36 50.00 36 50.00 Middle 18 25.00 54 75.00 High 18 25.00 72 100.00 Frequency Missing = 6

Missing values are removed.

Let’s create a crosstab of income by region. To do so, specify your variable combination by using “*” in the tables statement. proc freq data = life.lifeexp_xpand; tables income * region; run; Here’s your output.

Legend to read the four lines of each cell.

The upper left cell gives you the legend what four figures in each cell represent. The rows are for the variable “income” and the columns are for “region.” Just one thing to note: what if you need to show missing values in your frequency table? In that case, you need to add the missing option to the tables statement.

tables variable or variable-combination / missing;

44

Categorical frequency distribution can be checked visually as well. proc chart generates some basic ASCII charts in the output window. Let’s visualize the frequency distribution of the variable “income.” proc chart data = life.lifeexp_xpand; hbar income / discrete; run; And here’s your output.

proc chart has five statements that produce different types of ASCII graphics. hbar Horizontal bar chart pie Pie chart vbar Vertical bar chart star Star chart block Block chart The basic form is:

chart_statement varname(s) / options; By using the discrete option for the hbar statements, I told SAS in the above example that the numeric values of the variable “region” are discrete. By adding the group option, we can also produce a region-by-region distribution of income levels. Let’s try the program below. proc chart data = life.lifeexp_xpand; hbar income / discrete group = region; run; And your output below.

45

proc boxplot generates a set of group-by-group box plots. proc sort data = life.lifeexp_xpand; by region; proc boxplot data = life.lifeexp_xpand; plot gnppc*region; run; And it generates plots like below.

46

The key part of the program is:

plot analysis_variable*grouping_variable / options;. The statement has many options to control the appearance of the output plot, but required is the part before the slash. To examine bivariate correlations among variables, use proc corr. Suppose we want to examine correlations among life expectancy, per capita GNP, and population growth. proc corr data = life.lifeexp_xpand; var popgrowth lexp gnppc; run;

It first gives you simple descriptive statistics and then a correlation matrix that looks like this.

Each cell has three lines: Pearson correlation coefficients, probability, and the number of observations, from the top to the bottom. The var statement gives you correlations of all the pairs of the listed variables. With the with statement, you can get information about specified pairs. Suppose, for example, you are only interested in how the variable “lexp” is correlated with the other two. See the program and the output below. proc corr data = life.lifeexp_xpand; var popgrowth gnppc; with lexp; run;

47

As you can see in the output, proc corr by default uses pairwise deletion for missing values. With the nomiss option, proc corr uses listwise deletion and drops all cases with missing values on any of the specified variables. proc corr data = life.lifeexp_xpand nomiss; var popgrowth lexp gnppc; run; See your output. Now the number of cases is 63 for all the pairs. To generate a visual presentation of a correlation matrix, you have a couple of ways. One is to use SAS ODS (Output Delivery System). ODS lets you to produce output in a variety of formats and to make the formatted output easy to access. When we turn on ODS specifying ods graphics as its destination, you can use the plots option on proc corr to produce a scatter plot matrix. ods graphics on; proc corr data = life.lifeexp_xpand plots=matrix; var popgrowth lexp gnppc; run; ods graphics off; Go to Result window and unfold the result tree for the proc corr above (it says “Corr: The SAS System”) and find an output item named “scatter plot matrix.” Double click to open it, and you will get a scatter plot matrix.

48

http://support.sas.com/rnd/app/da/stat/odsgraph/index.html

Another way is to use proc insight. proc insight offers you an environment to perform a variety of interactive data exploration. Here, we can use it to create a basic scatter plot matrix. proc insight data = life.lifeexp_xpand; scatter popgrowth lexp gnppc * popgrowth lexp gnppc; run; You get a correlation matrix. One difference is that since proc insight is an interactive environment, you can directly examine your data on its graphics. Notice, for example, that I click on the potential outlier and SAS gives you what the observation is and where the case is in the other cells (see the bigger square points circled in red; the observation number is 70). Double click on that data point to get more detailed information. You see the observation is Burundi.

49

5. Analysis Example Now, finally, let’s take a quick look at a simple and basic data analysis example by using OLS. Suppose we are interested in finding what accounts for life expectancy and decide to test for possible effects of economic development, social infrastructure, and the demographic trend. From the literature and the descriptive exploration, we suspect that the relationship between economic development and life expectancy is curvilinear. That is, though economic development has a positive effect on life expectancy, beyond some threshold, each additional unit yields less and less impact on life expectancy. We decide to use per capita GNP as an indicator of development and test its logged term, but just in case also see how the result looks if it’s specified as a squared term. As for social infrastructure, we decide to use the number of physicians per 1,000 people as a proxy. This variable supposedly captures each society’s institutional strength in educating and producing medical experts and providing health services to its members. Societies with good social infrastructure are expected to enjoy higher life expectancy. Finally, we use annual population growth to measure the demographic trend. We expect this variable has a negative relationship with life expectancy. To conduct this simple analysis, we start with creating new variables to life.lifeexp_xpand data. data life.lifeexp_xpand; set life.lifeexp_xpand; lngnppc = log(gnppc); sqgnppc = gnppc*gnppc; label sqgnppc = 'Per capita GNP squared' lngnppc = 'Per capita GNP logged'; run; To run an ordinary least square model, use proc reg. The basic form is: proc reg; model dependent_var = independent_vars; So, we are running the following two models (the difference is the functional forms of the GNP variable). proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; model lexp = gnppc sqgnppc physicians popgrowth; run; Look at the outputs. From the fit statistics, you can see the more parsimonious model (i.e., the one with logged per capita GDP) performs better, so we will go with that model. Here’s your output of that model.

50

Number of Observations Read 78 Number of Observations Used 63 Number of Observations with Missing Values 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 1471.97271 490.65757 32.83 <.0001 Error 59 881.67808 14.94370 Corrected Total 62 2353.65079 Root MSE 3.86571 R-Square 0.6254 Dependent Mean 70.68254 Adj R-Sq 0.6064 Coeff Var 5.46911 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 41.95323 4.32361 9.70 <.0001 lngnppc Per capita GNP logged 1 3.19706 0.40893 7.82 <.0001 physicians # of physicians/1,000 pop 1 1.54149 0.55167 2.79 0.0070 popgrowth Population growth % 1 -0.05581 0.71379 -0.08 0.9379 The results give support to our hypotheses, except for population growth — which points in the right direction but is not statistically significant. We could add other statements and options to the general form of proc reg to conduct further analytic and also diagnostics investigations. Let’s get standardized coefficients to see the magnitude of each effect. Let’s also see if there is not any collinearity problem by adding the following options to the model statement.

• The stb option (which gives you standardized regression coefficient, i.e., beta) • The tol and vif options (tolerance and variance inflation factor) • The collinoint option (another way to check for collinearity; this option excludes the

intercept in the calculation of collinearity statistics. If include the intercept, use collin) Let’s examine both the per capita GNP variable specifications. proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth / stb tol vif collinoint; run; The standardized (beta) coefficient tells you a standard deviation change in the dependent variable yielded by a one standard deviation decrease in each of your independent variable.

51

Standardized this way, the beta coefficients let you compare how much impact each of your predictors would have. As for the collinearity diagnostics we tried, tolerance and VIF are inversely related (i.e., 1/tolerance = VIF) and thus tell you the same information. Although there is no definite cut-off line, a rule of thumb is VIF > 10 (or tolerance 0.1) merits further investigation. In our example, the tolerance/VIF values all look okay (see below). Standardized Variance Variable Label DF Estimate Tolerance Inflation Intercept Intercept 1 0 . 0 lngnppc Per capita GNP logged 1 0.68821 0.81937 1.22045 physicians # of physicians/1,000 pop 1 0.32045 0.48274 2.07152 popgrowth Population growth % 1 -0.00957 0.42347 2.36146 The collinearity diagnostics statistics obtained by collinoint the option also adds support to the absence of collinearity problem, since the general rule of thumb is the condition index larger than 30 indicates strong collinearity. Collinearity Diagnostics (intercept adjusted) Condition ---------Proportion of Variation--------- Number Eigenvalue Index lngnppc physicians popgrowth 1 1.82514 1.00000 0.06129 0.10283 0.11014 2 0.92648 1.40356 0.69954 0.10651 0.00210 3 0.24837 2.71078 0.23918 0.79066 0.88776

OK, let’s see how to examine residuals to check for the possible problems of unequal variance, influential cases, and non-normality. We start with the first two by using the below statement/option.

• The partial option on the the model statement (which gives you partial regression plot for each regressor).

• The plot statement in proc reg. So, let’s first try this below. proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth / partial; plot r.*p.; run; Examine the output; we decide the partial residual plots (output omitted here… too big) do not seem to show a clear sign of heteroskedasticity, but the plots of the variables “physicians” and “popgrowth” have one case with a rather large residual.

52

In the plot statement we tell SAS to get the residuals (residuals.) and predicted values (predicted.) and plot them. The residuals. and predicted. can be abbreviated as in the above program. The plot below is just to get some crude and quick idea (crude, because it’s ultimately about error variance). As you see, the residuals tend to spread wider toward the left and narrower toward the right. Also, one case again shows a large deviation from the prediction (lower left corner), which may merit a little more investigation.

So let’s do a little more investigation about possible heteroskedasticity. In this example, we will use an option that allows us to check if there is any problematic pattern of variance.

• The spec option (which performs the White test for unequal error variance). proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth / spec; run; And the test result is here. Test of First and Second Moment Specification DF Chi-Square Pr > ChiSq 9 10.69 0.2973

The null hypothesis of this test is homoskedasticity, and the test fails to reject it, meaning that we can safely conclude the error variance is equal. Good. Let’s also see if the outlying observation we noticed above may have any significant impact on the result. Again, it is a good idea to start with some visual checks to get some rough idea. Below, we tell SAS to get the Cook’s D statistics, the hat matrix (leverage), and studentized residuals

53

and plot them for a visual check (the first two with the observation number, and the last one is between the studentized residuals and the leverage statistics). The keywords to use for the statistics are cookd., h., and rstudent. respectively, and the observation number can be obtained by the obs. keyword. proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; plot cookd.*obs. h.*obs. rstudent.*h.; run;

We get the following three graphs, in the order requested in the above program (red circles mine).

54

To get further information, let’s save the diagnostics statistics and check the outlying cases. This is our first (and last) example where the PROC steps can generate a data set. proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; output out = influence (keep = country res sres lev cd ) r = res rstudent = sres h = lev cookd = cd; run; Here, to create a SAS data set containing the result statistics (in this case, the residual/influence statistics we are working on), we are using the output statement on proc reg, followed by the out = option to specify our new data set name (let’s have it as a temporary SAS data set and name it “influence”; notice that we use the keep= option here to specify the names of the variables to keep for the new data), then by the list of the statistics names and their variable names. output out = new_output_data_name output_statistics; And using this new data “influence,” we print cases that exceed the general rule-of-thumb thresholds of these influence statistics. proc print data = influence; var country sres; where abs(sres) > 2; /* rstudent rule of thumb: abs(rstudent) > 2*/ proc print data = influence; var country lev; where lev > (2*3+2)/63; /* leverage rule of thumb: > (2k+2)/n where k = # or iv’s, n = # of obs */ proc print data = influence; var country cd; where cd > 4/63; /* cookd rule of thumb: > n/4 */ run;

55

From the visual check of the overall pattern and the list of the potentially problematic countries, we might particularly want to check Burundi, and perhaps Jamaica, Haiti, and Oman. proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; where country ^= 'Burundi'; run; We can also try removing “Oman” in the same way. Also try the combinations of Burundi and the others. For example, proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; where (country ^= 'Burundi' & country ^='Jamaica' & country ^= 'Haiti'); run; Examine your results (output omitted here). It seems the general conclusion remains the same; the logged per capita GNP and physicians are both statistically significant in the hypothesized direction and as such there does not seem to be any problem of influential cases so glaring as to change our theoretical argument entirely. It should be noted, however, Burundi seems to have a rather large impact on the effect size of the physician variables, whether by alone or by any combination of the questionable cases; the coefficient estimate drops by approximately 30% when this country is dropped from the analysis. We can also check for normality of our residuals by using the same temporary data “influence.” Let’s do a quick check for the residual distribution. proc univariate data = influence plot normal; var res; histogram / normal; qqplot / normal (mu=est sigma=est); run; Here we only add to proc univariate another statement qqplot (with an option normal) to obtain a normal quantile-quantile plot as well. The mu= and sigma= options for normal display a distribution reference line, and here I set both to est, so that the mu is set to the sample mean and the latter to a maximum likelihood estimate of the sigma. As you can see from the visual information (output omitted here), the residuals look fairly normally distributed, except for just one outlier. The normality test shows the following. Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.95393 Pr < W 0.0193

56

Shapiro-Wilk is used when the number of observations is less than 2000 (which is the case with our example), and its W statistics indicates normality when it’s closer to 1.0. As you see, the test rejects the null hypothesis of normality. In this case, however, you are almost certain it is due to the one outlier. Try removing Burundi and the test easily fails to reject the null. Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.988949 Pr < W 0.8517 Finally, many of the visual information we learned to obtain above can be obtained by using ODS graphics. Try the below program and you will get a set of diagnostics plots quickly. ods graphics on; proc reg data = life.lifeexp_xpand; model lexp = lngnppc physicians popgrowth; run; ods graphics off; Here you go. Some of them are exactly what we got above already.

57

1That’s all for The Very Basics of SAS. Thanks for Playing!

1 Error report: [email protected]

58

mailto:[email protected]

Date post:	24-Oct-2014
Category:	Documents
Upload:	lamillionnaire
View:	25 times
Download:	2 times