INTRODUCTION TO SAS

The Statistical Analysis System (SAS) offers: a powerful, high level programming language; a large series of built in mathematical and statistical functions (much greater than Fortran); a state of the art statistical package; an interactive matrix language; time series analysis; model fitting by OLS, GLS, MLE for many problems, both linear and nonlinear; graphics; and report writing. In most IBM environments (SAS was initially written in PL/1, a language for the IBM), SAS quickly lead to the demise of SPSS and BMD because it was not simply a statistical package--it was also a high level programming language like FORTRAN. Consequently, it was possible to manipulate and manage data prior to an analysis without writing several different programs to get data into the proper format for SPSS. SAS also can produce tables that are manuscript ready.

A SAS beginner, however, pays for the flexibility by confronting such a large series of options and operations that it appears almost impossible to learn. This is meant as an introduction to highlight the most important commands useful for those interested in using SAS more for number crunching than for fancy report writing. For the most part, the statements included herein will get you through SAS with no problem. It is suggested that you go over this to become familiar with the SAS statements that will be most useful for you, and then consult SAS manuals for further information about these statements.

GENERAL SYNTAX

SAS statements may be in upper or lower case and may begin on any column. SAS statements may also extend across lines, and more than one SAS statement may appear on a single line. SAS statements always end with a semicolon (;). Thus, the following set of statements are all equivalent:

IF SEX=3 THEN DO;
PUT 'BAD SEX FOR ID NUMBER' ID;
DELETE;
END;

/* the following statements are equivalent to those given above */

IF
SEX=3
THEN DO ;
PUT 'BAD SEX FOR ID NUMBER' ID; DELETE; END;

SAS STEPS

There are two general steps in SAS. The first is the DATA step in which data are read in, manipulated, edited, etc. The second is the PROC or procedure step in which some statistical procedure (e.g., MEANS, ANOVA) is performed on the data. Any number of DATA and PROC steps can occur in a single program. For example, one can read a set of data in the first DATA step, perform a regression (PROC REG) that outputs predicted values and standardized residuals to the data, use a second DATA step to remove outliers, do another PROC REG without the outliers, and merge the full data set with an exiting SAS data file in a third DATA step. The two steps are discussed in turn.

DATA STEP

There are 4 basic uses of the SAS DATA step: (1) getting data into a SAS data set; (2) manipulating the data; (3) managing the data set; (4) creating data. Each is discussed below:

(1) DATA STEP: Getting data into a SAS data set:
(1.a) There are three ways of getting data into a SAS data set. They are:

1. Including the data in the SAS command stream. In this sense the data are like a card deck placed into the stream of SAS commands. Use an INPUT command to list the variables and a CARDS statement right before the data to be read in.
DATA CARDSIN;
INPUT IDNUM SEX AGE;
CARDS;
1 1 25
2 2 33
4 1 55

2. Read the data in from a disk file. Here one uses the INFILE command to name the disk area with the data and the INPUT command to list the variables.
DATA DISKIN;
INFILE 'RAWDATA.DAT';
INPUT IDNUM SEX AGE;

3. Create a new data set from an existing SAS data set. Here, the SET command is used to name the existing SAS data set. The following example creates two new SAS data sets from an existing SAS data set:
DATA FATHERS MOTHERS; SET PARENTS;
IF SEX=1 THEN OUTPUT FATHERS;
ELSE OUTPUT MOTHERS;

(1.b) INPUT statement. There are three forms of the INPUT statement used for reading in raw data: (1) free format input; (2) columnwise input; and (3) formatted input.

1. Free formatted input: Just list the variables. On the raw data file, each case must start on a separate record and each variable must be separated by at least one space. For example,
INPUT IDNUM SEX AGE;

2. Columwise input: The input statement lists the variable and its inclusive column numbers. For example, INPUT IDNUM 1-4 SEX 6-6 AGE 8-9;

3. Formatted input: This lists the variables in parentheses, then the format, also in parentheses. For example, INPUT (IDNUM SEX AGE) (4. +1 1. +1 2.);

SAS does not distinguish between integers and real numbers. It treats all variables as real. Thus, the "4." is equivalent to an "F4.0" in FORTRAN, a 4.3 to "F4.3", etc. In formatted input, SAS uses a "+n" to skip n columns, a "/" to skip records, an "@n" to transfer to a column n, and a "#n" to transfer to record n. Thus the following FORTRAN statement and the two SAS formats are equivalent:

INPUT (ITEM SCORE X SEX) (2. 8.3 @20 10.6 / @10 8.);
or
INPUT (ITEM SCORE X SEX) (#1 2. 8.3 @20 10.6 #2 @10 8.);

By using the INFILE N=n statement, one can also skip around on formatted input. For example, you can read one variable from columns 6-10 on the first record, read the second variable from columns 77-78 of the fourth record, go back to the first record and read columns 1-2, etc.

NOTE WELL: In SAS formatted input, the format list is processed only until the number of variables in the variable list. Thus, slashes ("/") at the end of a format are NOT processed and SAS will not skip records the way some FORTRAN dialects will. The following formats may NOT be equivalent:

INPUT (ITEM SCORE) (2. 8.3 /);

(2) DATA STEP: Manipulating data
SAS is a very rich language for data manipulation. The most useful commands are given here. See the SAS manual for their syntax.

ARRAY Works like a DIMENSION statement in Fortran. For example,
ARRAY A [51] A0-A50;
ARRAY B [51] B0-B50;
DO I=0 TO 50 BY 2;
A[I] = B[I];
END;

DELETE Deletes the observation from a data set.
IF SEX NE 'MALE' AND SEX NE 'FEMALE' THEN DELETE;

DO ... END Works like a do loop in fortran.
IF SEX NE 'MALE' AND SEX NE 'FEMALE' THEN DO;
PUT 'VALUE OF SEX IN ERROR FOR ID NUMBER' ID;
DELETE;
RETURN;
END;

OUTPUT Outputs the observation to the current data set.

MISSING Defines missing values. By default, SAS uses a period (.) as a missing value.

GO TO Like a GOTO in BASIC or FORTRAN, but SAS statements are given name labels that end with a colon.
IF RESID GT 3 THEN GOTO OUTLIER;
<other SAS statements in here>
OUTLIER:
PUT 'ID NUMBER' ID 'IS AN OUTLIER. '
RESID= .;
DELETE;
RETURN;

LINK...RETURN This is analogous but not identical to a subroutine call. Processing returns to the SAS statement after the LINK statement. Becareful about the RETURN statement. A RETURN statement that is not part of a LINK series begins processing of a new observation. A RETURN statement after a LINK returns to the next statement following its call.
IF IDNUM NE LASTID THEN LINK NEWID;
SUMX=SUMX+X;
RETURN;
NEWID:
OUTPUT;
SUMX=0;
LASTID=IDNUM;
RETURN;

SELECT An economial way to write a number of IF ... THEN statements.

PUT Equivalent of WRITE or PRINT in Fortran. If a FILE statement is used, the output goes to the named file; otherwise, it goes to the print file. The PUT statement has the same 3 syntax forms as the INPUT statement: free form, columnwise, and formatted.
FILE 'SASOUT.DAT';
PUT (ID SEX AGE) (4. 2. 3.);

(3) DATA STEP: Data Management
SAS data sets can exist only for the duration of a single job or they can be stored on a more permanent basis like an SPSS system file. Permanent SAS data sets are very easy to manipulate, update, etc. But when a SAS data set is first made permanent or later updated, it is a good idea to manage the data to reduce storage costs. The following statements are useful in this regard:

DROP Drop the variables from the data set. Useful for variables only used for temporary programming convenience.
DROP I J TEST1-TEST5 ABX--XXY;

KEEP Keeps the listed variables. Opposite of DROP.

LENGTH Sets the storage size for variables. If you are creating or analyzing large SAS data sets, then it is highly recommended that a length statement be used for integer type data because all SAS variables default to real variable type storage (and thus take up more room). The following statement sets the 566 Minnesota Multiphasic Personality Inventory items to three bytes of storage each:
LENGTH MMPI1-MMPI566 3;

(4) DATA STEP: Data creation
The SAS DATA step can also be used for general programming when no "real" data are used. SAS has a wide variety of built in functions that can Monte Carlo data or generate a statistic that is hard to calculate by hand. See the SAS manual for the functions.
DATA _NULL_;
/* THIS GIVES THE EXACT PROBABILITY OF A CHI SQUARE OF 27.4275 WITH
36 DEGREES OF FREEDOM */
P1=PROBCHI(27.4275,36); PUT P1=;
/* THIS GIVES THE SAME, BUT WITH A NONCENTRALITY PARAMETER OF 10 */
P2=PROBCHI(27.4275,36,10); PUT P2=;

PROC STEP

The PROC step calls up a SAS procedure. Some SAS procedures are for data management, others are for statistical analysis. Several SAS speciality packages exist for specialized graphics, data management (along the lines of dBASE-III), communications, econometric & time series analysis, computer-based training, census tract and SMSAs data bases and directories, etc. SAS can convert SPSS, BMD-P, OSIRIS and other data sets into SAS format with a one line command. It can call up a BMDP program to analyze data in a SAS data set. It will even draw you a map of the wine growing regions of Portugal, if you so desire. Only the basic PROCs are listed here. SAS often has several PROCs for most types of analyses; one is a general purpose PROC, the others are more efficient subtypes. PROC steps may also produce specialized data sets that can be input into other SAS procedures (e.g., using PROC FACTOR to produce a factor data set that then can be used to try different rotational strategies, different number of factor extraction, etc.).

 Data Management Procedures PROC SORT Sorts a data set by one or more variables. PROC SORT; BY ID; will sort the data set by the values of the variable ID. PROC CONTENTS Displays the contents of the data set. PROC DATASETS Manages SAS data set libraries. PROC RANK Rank orders one or more variables. PROC STANDARDIZE Rescales variables to a specified mean and/or standard deviation. PROC SCORE Generates linear scores for certain procedures like factor analysis and discriminant analysis. PROC TRANSPOSE Transposes a data set.

 Low Resolution Graphics PROC CHART Pie, bar, and star charts. PROC PLOT Two dimensional plots.

 Descriptive Statistics PROC FREQ Simple frequencies and contingency tables for categorical variables. PROC MEANS Number of observations, mean, standard deviation, and minimum and maximum values for continuous variables. PROC UNIVARIATE More detailed descriptive statistics for continuous variables. PROC TABULATE Produces tables of frequencies and/or descriptive statistics. PROC SUMMARY Descriptive statistics broken down by groups; particularly useful for generating a data set of descriptive statistics for input into other procedures. PROC CORR Parametric and nonparametric correlations.

 Regression Procedures PROC REG General purpose linear regression and multivariate regression. PROC GLM General linear models, including regression, analysis of variance/covariance, and multivariate analysis of variance/covariance. PROC RSQUARE All possible subsets of regression. PROC RSREG Quadratic response surface regression. PROC LOGISTIC Logistic regression. PROC PROBIT Probit regression.

 Analysis of Variance Procedures PROC ANOVA Analysis of variance for orthogonal data. PROC GLM General linear models, including regression, analysis of variance/covariance, and multivariate analysis of variance/covariance. PROC NESTED Nested analysis of variance. PROC VARCOMP Variance components.

 Discriminant Procedures PROC DISCRIM General purpose parametric and nonparametric discriminant analysis. PROC CANDISC Canonical discriminant analysis.

 Principal Components and Factor Analysis Procedures PROC PRINCOMP Principal components. PROC FACTOR Factor analysis.

 Survival Analysis PROC LIFETEST Nonparametric and life tables. PROC LIFEREG Parametric survival analysis.

 Cluster Analysis PROC CLUSTER Clustering observations. PROC FASTCLUS Disjoint clustering for large data sets. PROC VARCLUS Clustering variables.

Interleaving DATA and PROC Steps

To illustrate the utility of the SAS data step used in conjunction with various SAS procedures, consider the problem of getting the correlation matrices for a multivariate twin analysis of the National Merit Twin data on the National Merit test. SAS is used to "double enter" the twins, creating an intraclass matrix. The data are then sorted by sex and zygosity, the correlation matrices are calculated, and output to a raw data file so that model fitting can be done. (SAS can also efficiently do model fitting with IML, but that is too advanced for our purposes here). Finally, other MZ and DZ matrices are computed by pooling males with females and output.
DATA NMTWINS;
INFILE 'NMTDIR\NMTNMT.DAT' N=2;
/* A PAIR OF TWINS ARE READ IN. THE NATIONAL MERIT SCORES FOR TWIN A
ARE MNEUMONIC, THOSE FOR TWIN B ARE CALLED NMT1 THROUGH NMT5 */
INPUT (SEX ZYG ENGLISH MATH SOCSCI NATSCI VOCAB NMT1-NMT5)
(@6 2*2. @10 5*2. #2 @10 5*2.);
ARRAY TA [5] ENGLISH--VOCAB;
ARRAY TB [5] NMT1-NMT5;
OUTPUT;
/* REVERSE TWIN SCORES */
DO I=1 TO 5;
TEMP=TA[I];
TA[I]=TB[I];
TB[I]=TEMP;
END;
DROP TEMP I;
OUTPUT;

/* SORT THE DATA SET BY SEX AND ZYGOSITY */
PROC SORT; BY SEX ZYG;

/* GET THE CORRELATION MATRICES FOR EACH SEX & ZYGOSITY COMBINATION.
AND SAVE THE MATRICES IN DATA SET TWINCORR */
PROC CORR OUT=TWINCORR;
BY SEX ZYG;
VAR ENGLISH--NMT5;

/* WRITE A VECTOR OF STANDARD DEVIATIONS FOLLOWED BY THE CORRELATION
MATRIX ONTO DISKFILE NMTNMT.COR */
DATA _NULL_;
SET TWINCORR;
IF _TYPE_='STD' OR _TYPE_='CORR';
FILE 'C:\NMTDIR\NMTNMT.COR';
PUT (ENGLISH--NMT5) (5*12.8/5*12.8);

/* POOL MALES & FEMALES--STANDARDIZE BY SEX TO REMOVE MEAN GENDER
DIFFERENCES */
PROC STANDARD DATA=NMTWINS;
BY SEX;
VAR ENGLISH--NMT5;
PROC SORT;
BY ZYG;

PROC CORR OUT=TWINCOR2;
BY ZYG;
VAR ENGLISH--NMT5;
DATA _NULL_;
SET TWINCOR2;
IF _TYPE_='STD' OR _TYPE_='CORR';
FILE 'NMTDIR\NMTNMT2.COR';
PUT (ENGLISH--NMT5) (5*12.8/5*12.8);