Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | barnaby-lester |
View: | 219 times |
Download: | 1 times |
SAS SQL SAS Seminar Series
Shamika KetkarJuly 14th, 2008
SQL
♦ Structured Query Language♦ Developed by IBM in the early 1970’s♦ From the 70’s to the late 80’s there were
different types of SQL, based on different databases.
♦ In 1986 the first unified SQL standard (SQL-86) was created.
♦ In 1987 database interface for SQL was added to the Version 6 Base SAS package
♦ A “language within a language”
SQL Nomenclature♦ Tables (datasets)♦ Rows (observations)♦ Columns (variables)
Anatomy of A PROC SQL StatementPROC SQL; SELECT column list FROM table list WHERE condition list GROUP BY column list ORDER BY column list ;quit;
Features♦ SQL looks at datasets differently from SAS
◘ SAS looks at a dataset one record at a time, using an implied loop that moves from the first record to the last
◘ SQL looks at all the records, as a single object◘ Because of this difference SQL can easily do a
few things that are more difficult to do in SAS♦ SQL commands are available for creating tables,
changing table structures, changing values in tables, functions and more…
Processing Large Datasets: Create View
♦ When a table is created, the query is executed and the resulting data is stored in a file.
♦ When a view is created, the query itself is stored in the file. The data is not accessed at all in the process of creating a view.
♦ By default, PROC SQL will print the resultant query (use NOPRINT option to suppress this feature). But NO output is produced when a view is created.
Create ViewPROC SQL; CREATE VIEW out.c1data AS SELECT * FROM data.allgenostarc1 AS a, pheno.new_gtriplet AS b WHERE a.subject=b.subject; ORDER BY a.subject;QUIT;
Log Snippet
NOTE: SQL view ME.C1DATA has been defined.NOTE: PROCEDURE SQL used (Total process time): real time 0.86 seconds cpu time 0.01 seconds
Log Snippet
The CONTENTS ProcedureData Set Name out.c1data Observations .Member Type VIEW Variables 4Engine SQLVIEW Indexes 0Protection Compressed NOData Set Type Sorted YES
# Variable Type Len Format Informat 3 age Num 8 5 pedid Num 8 BEST12. F12. 4 sex Num 8 BEST12. F12. 1 subject Num 8 11. F11.
♦ SAS stores it with an extension ‘sas7bvew’
View from ViewPROC SQL; CREATE VIEW out.agecat as SELECT *, CASE WHEN . lt age le 18 THEN 1 WHEN 18 lt age le 25 THEN 2 WHEN 25 lt age le 40 THEN 3 WHEN 40 lt age le 55 THEN 4 WHEN 55 lt age le 70 THEN 5 WHEN age gt 70 THEN 6 ELSE . END AS agecat format=1. FROM out.c1data; QUIT;
SQL FunctionsPROC SQL;SELECT COUNT(DISTINCT subject), agecat, sexFROM out.agecat GROUP BY agecat, sex;QUIT;
$ agecat sex --------------------- 1 1 0 79 2 0 118 2 1 322 3 0380 3 1608 4 0 741 4 1461 5 0 452 5 1 42 6 0 32 6 1
PROC SQL noprint;SELECT COUNT(DISTINCT subject)INTO :subj1-:subj2FROM out.agecatGROUP BY sex;QUIT;%PUT "Males=" &subj1 “Female =“ &subj2;
Macro Variable
SQL Functions♦ PROC SQL supports all the functions available to the SAS
DATA step that can be used in a proc sql select statement ♦ Because of how SQL handles a dataset, these functions
work over the entire dataset♦ Common Functions:
◘ COUNT◘ DISTINCT◘ MAX◘ MIN◘ SUM◘ AVG◘ VAR◘ STD◘ STDERR
♦ PROC SQL does not support LAG, DIF, and SOUND functions.
◘ NMISS◘ RANGE◘ SUBSTR◘ LENGTH◘ UPPER◘ LOWER◘ CONCAT◘ ROUND◘ MOD
Creating Index
PROC SQL; CREATE UNIQUE INDEX id ON data.goldn(id);
♦ Indexes are auxiliary data structures that can be used to improve performance of large data sources
♦ Stored in the same directory as the indexed table in a different file, same name, different extension
Why use Indexes?♦ NO Index?
◘ Lookups must read the entire data portion of the table from start to finish to be certain of finding all matches
◘ This means a lot of CPU and I/O time used to read records that are never needed
♦ Index?◘ SAS will automatically detect and exploit the index if it
can improve performance◘ The index file contains a list of key variable values and
their location within the data table◘ The index supplies a list of matching records positions
which is then used to interrogate the table itself◘ Only the parts of the table that are needed are read
which means less CPU and I/O time
Merge without SortPROC SQL; CREATE TABLE goldndata AS SELECT * FROM goldn.gtriplet AS a, goldn.blood AS b WHERE a.id=b.id;QUIT;
♦ No presorting required ♦ No requirement for common variable names to join on (should be same type, length)
PROC SQL; CREATE TABLE goldndata AS SELECT * FROM goldn.gtriplet AS a, goldn.blood AS b WHERE a.myid=b.id;QUIT;
Full Join InnerJoin
Left Join Right Join
Combining Datasets: Joins
If a or b; If a and b;
If a; If b;
Changing the Order of Variables♦ Changing the Order of Variables in Your Data Set –
some genetics software require id as first column…
Table 1. Order of variables before changing (oldfile)
Age
Sex
Subject
Table 2. Order of variables after changing (newfile)
Subject
Sex
Age
Changing the order…PROC SQL;CREATE TABLE newfile ( subject num,
sex num, age num
);INSERT INTO newfile SELECT subject, sex, age FROM me.c1data;QUIT;proc contents data=newfile; run;
Alphabetic List of Variables and Attributes # Variable Type Len 3 age Num 8 2 sex Num 8 1 subject Num 8
Log Snippet…
Matching, Sounds-Like…♦ Phonetic Matching: Sounds-Like Operator =*
◘ A technique for finding names that sound alike or have variations in spelling. The sounds-like operator "=*" searches and selects character data based on two expressions: the search value and the matched value.
♦ Pattern Matching: % Wildcard character ◘ The % acts as a wildcard character representing any
number of characters, including any combination of uppercase or lowercase characters. Combining the LIKE predicate with the % (percent sign) permits case-sensitive searches.
PROC SQL; CREATE VIEW map AS SELECT * FROM map.map;QUIT;PROC SQL; SELECT * FROM map WHERE GeneSymbol LIKE 'CYP%'; * WHERE GeneSymbol =* "CYP19";QUIT;
Matching, Sounds-Like…
Creating Macro Variables with Proc SQL♦ Select ALL Unique Values Into a Macro Variable: Keyword
DISTINCT eliminates duplicates.
PROC SQL NOPRINT; SELECT DISTINCT genesymbol INTO :gene SEPARATED BY ', ' FROM map.map; QUIT; %put &gene; List file Snippet GIMAP4,GIMAP5,GIMAP6,GIMAP7,GIMAP8,GIOT-
1,GIP,GIPC1,GIPC2,…
♦ Without the SEPARATED BY clause each value put into the macro variable would overwrite the previous value and we would end up with an array with the single value which would be the last value of the variable.
♦ Select ALL Unique Values Into a Macro Variable but this time add double quotes using Quote function and delete consecutive blanks using compbl function.
PROC SQL NOPRINT;
SELECT DISTINCT quote(compbl(genesymbol))
INTO :gene SEPARATED BY ', '
FROM map.map;
QUIT;
%put &gene;
List file Snippet…
"GIMAP4 ","GIMAP5 ","GIMAP6 ","GIMAP7 ","GIMAP8 ","GIOT-1 ","GIP ","GIPC1 ","GIPC2
Macro Variables with Proc SQL contd…
♦ Select all variable names and create a macro array: the simplest way would include the output from proc contents:
PROC CONTENTS DATA=mydata(KEEP = diabetes -- asthma )
OUT=vars(KEEP = name varNum ) NOPRINT;RUN ;PROC SQL NOPRINT ;SELECT name INTO :row_1 - :row_&SysMaxLongFROM varsORDER BY varnum ;QUIT ;
CREATING MACRO ARRAYS USING PROC SQL
Finale
♦ PROC SQL is an additional tool with its own strengths and challenges
♦ Many times it is just another way to do the same thing
♦ BUT other times it might be much more efficient and may cut down the number of sorts, data steps & procedures or lines of code required.
Suggested Readings
♦ Papers◘ SQL for People Who Don’t Think They Need
SQL: Erin M. Christen (PharmaSUG 2003) ◘ Ten Great Reasons to Learn SAS® Software's
SQL Procedure: Kirk Paul Lafler (SUGI23)
♦ Books◘ Proc SQL Beyond the Basics: Kirk Paul Lafler◘ SAS Guide to the SQL Procedure
Thank you!