/* --- File glim01.sas --- example of generalized linear models using the detroit data set. the first model is a simple multiple regression, but instead of least squares, it uses maximum likelihood to obtain parameter estimates and tests of the hypotheses. the second model uses a log link function examine the variance of the residuals to see which model gives a better fit --- */ * - NOTE: You must assign the LIBNAME p7291 to the directory contining the detroit data set; *LIBNAME p7291 ''; TITLE Example of GLIM: Generalized Linear Models; DATA temp; SET p7291.detroit; RUN; * - The DIST= option specifies the distribution of the dependent variable (termed a "response variable" in GLIM parlance). Here it is assumed to be a normal. The LINK= option gives the function of the expected value of the response variable that is a linear function of the predictor variables (got that?). In this case, "identity" implies that the expected value of the homicide rate is a linear function of the predictors--i.e., an ordinary multiple regression. The OUTPUT= option creates a new data set containing the predicted and residual values; TITLE2 'maximum likelihood multiple regression'; PROC GENMOD DATA=temp; MODEL hom = year uemp lic / DIST = normal LINK = identity TYPE3; OUTPUT OUT=temp2 PREDICTED=Pred_Reg RESDEV=Resid_Reg; RUN; * - Here the link function is a Log. The log of the expected value of the homicide rate is a linear function of the predictors.; TITLE2 'LOG link function'; PROC GENMOD DATA=temp2; MODEL hom = year uemp lic / DIST = normal LINK = log TYPE3; OUTPUT OUT=temp3 PREDICTED=Pred_Log RESDEV=Resid_Log; RUN; * - Generally the model with the higher log likelihood is the one to be prefered. Let's also check the variance of the residuals to assess the two models Note that inspection of MIN and MAX is also useful in case one of the models gives an outlier; TITLE2 Comparison of the standard deviation of the residuals; PROC MEANS DATA=temp3 MEAN STD VAR MIN MAX; VAR Resid_Reg Resid_Log; run; * - the Log link function gave a variance of the residuals (1.41) that is roughly 60% of the variance using the regression (2.33) Hence the log function is preferable;