Statistics, Data Analysis, and Data Mining Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: The function f is known as the link distribution. For ANOVA, the link distribution is the There are several procedures in the SAS System identity. Other possible link functions include for statistical modeling. Most statisticians who the logit for logistic regression or log for count use the SAS system are familiar with procedures data. Whereas in general linear models, it is such as PROC REG and PROC GLM for fitting necessary to assume that the errors are general linear models. However PROC independent, have equal variances and are GENMOD can handle these general linear normally distributed, none of these assumptions models as well as more complex ones such as are necessary in GLMs. logistic models, loglinear models or models for count data. In addition, the main advantage of In the SAS System, GLMs can be fitted in PROC GENMOD is that it can accommodate PROC GENMOD. But separate procedures the analysis of correlated data. In this paper, we exist for certain sub classes of GLM models will discuss the use of PROC GENMOD to such as logistic regression or general linear analyze simple as well as more complex models. For example, PROC GLM can fit statistical models. When other procedures are general linear models. Although regression available to perform the same analysis, we will analysis can be fitted with PROC GLM, PROC highlight the options from these procedures that REG is more specific to this type of analysis. may be missing in PROC GENMOD but might Similarly, there exists also a PROC ANOVA be of interest to the user. An example is given that is specific as the name indicates to ANOVA showing how PROC GENMOD is used to models. Another SAS procedure for analyzing a analyze various types of endpoints (continuous subclass of GLM models is PROC LOGISTIC. and count data) from a toxicology experiment. The materials in this paper should be accessible In this paper, we will provide an overview of even to those users with limited data analysis some of the models that can be fitted with skills. PROC GENMOD. When these models can be fitted by other SAS procedures, we will outline 1. Introduction some differences between these procedures and GENMOD that the user needs to be aware of. Generalized linear models (GLMs) include the In section 2, we discuss the fitting of GLM most common statistical models used in models in GENMOD and other procedures. The Statistics. This class of models includes general fitting of logistic models is discussed in section linear models and logistics models. It should be 3. Other types of models such as those noted that general linear models include involving other link distributions besides the ANOVA models as well as regression analysis. logit and the identify as well as models for Complex models such as those arising from correlated data are discussed in section 4. In correlated data (repeated measures, clustered section 5, we provide an example showing how data) can also be fitted with GLMs. The form of we have used PROC GENMOD to analyze a GLM model is given by: different types of endpoints from the National Toxicology Program (NTP). f(Y)'Xβ % ε (1) Statistics, Data Analysis, and Data Mining 2. Fitting of General Linear PROC GENMOD uses maximum likelihood Models in GENMOD and Other methods. For general linear models, the maximum likelihood and the least squares Procedures methods yield the same estimates. There are many procedures besides PROC There are instances, where the data analyst is GENMOD in the SAS System for the fitting of familiar with the data and may want to use general linear models. The three most PROC GENMOD to create outputs from the commonly used are PROC ANOVA, PROC analysis of general linear models that are REG and PROC GLM. PROC GLM is the most uniform with other outputs from more complex comprehensive of the three models. Any models that can only be analyzed in PROC analysis of general linear models can be GENMOD. An example of that will be given in performed in this procedure. However, the section 5. In general, we suggest that other two procedures are more efficient or offer GENMOD be used for analysis of GLM models more options for certain subclasses of general only in those instances where the analyst wishes linear models. PROC ANOVA handles analysis to obtain coefficient estimates and their of variance models with balanced designs. For variance. In these cases, the analyst should these models, PROC ANOVA is faster than know from prior experience with similar data, PROC GLM. For regression models, PROC that the data is well “behaved” and that no REG provides many options that are not found general linear models assumptions are violated. in other procedures. Some of the advantages of PROC REG are that the user can fit several We give below some examples of the use of models with one call of the procedure, there are PROC GENMOD for analysis of general linear also more options for model selections and models. diagnostic tools to detect multicollinearity. Multicollinearity occurs when the independent 2.1 Example of a Regression analysis variables in the models are correlated among themselves. This can lead to large variance for Suppose sales for a company in a district can be the estimated coefficients and affect our predicted as a function of the target population interpretation of these coefficients. and the per capita discretionary income. The data comes from Applied Linear Statistical Although, PROC GENMOD can fit any general Models from Neter, Wasserman and Kutner linear model, there are many useful options that (page 249). Regression coefficients can be it does not provide. For example, in an obtained with either of the following two ANOVA model with random effects, one may syntaxes: be interested in estimating the variance components. This would not be possible in proc genmod data=salesdata; PROC GENMOD. With ANOVA models, one model sales=target_population may also want to compare the means of an discretionary_income; independent variable at several levels. As of version 7.0, there is an LSMEANS statement in or PROC GENMOD, but unlike in PROC GLM, only the least squares estimates can be obtained proc reg data=salesdata; with this statement, no statistical comparisons of model sales=target_population the means can be made. It should be noted that discretionary_income; while the other procedures that we have mentioned in this section used least squares 2.2 Example of an Analysis of Variance methods to estimate the coefficient parameters, Statistics, Data Analysis, and Data Mining Suppose a research laboratory develops a new because of the options it provides for model compound for the relief of severe hay fever and selections. On the other hand, if one is only wants to compare the effect of the ingredients on interested in finding out whether an independent the outcome. The outcome being measured is variable has a significant effect on the variation number hours of relief. This hypothetical of a proportion then the analyst has the choice of example is also taken from Neter, Kutner and using either PROC LOGISTIC or PROC Wasserman (page 722). We could perform this GENMOD. In earlier versions of the SAS analysis in PROC GENMOD with the following System [at least up to version 6.12], there was syntax: no CLASS or CONTRAST statements in PROC LOGISTIC. Thus, for the analysis of categorical proc genmod data=compounds_data; variables one might have preferred PROC class ingredient1 ingredient2; GENMOD over PROC LOGISTIC in earlier model hours_of_relief= versions, since these categorical variables would ingredient1 ingredient2; have to be recoded in a data step prior to the call of the LOGISTIC procedure. The same analysis could also be performed in PROC GLM: We give here an example of the use of PROC GENMOD for the analysis of binary data. proc glm data=compounds_data; class ingredient1 ingredient2; Example of a logistic regression analysis with model hours_of_relief= ingredient1 ingredient2; binary data 3. Fitting of Logistic Models in PROC Bliss (1935) reports the proportion of beetles GENMOD and PROC LOGISTIC killed after 5 hours of exposure at various concentrations of gaseous carbon disulphide. Logistic models are of the form: To obtain the regression coefficient to model proportion of beetles killed as a function of p dosage, the following SAS code can be used: log 'Xβ % ε (2) 1&p These models are appropriate for modeling proc genmod data=beetle_data; model proportions. Similar to a regular regression, a number_killed/number_of_beetles= logistic model can be used to predict the dosage/link=logit dist=binomial; proportion p that will be obtained for given values of the independent variables. But a PROC Logistic could also be used: logistic model can also be used to determine whether an independent variable significantly proc logistic data=beetle_data; affects the variation of the dependent variable. model In these cases, we are interested in knowing number_killed/number_of_beetles= whether the odds of having the outcome are the dosage/link=logit dist=binomial; same for all levels of the dependent variable(s). For example, we may be interested in Notice that we needed to specify the LINK and determining if the odds of having a given DIST option since the defaults values used disease is the same for smokers and would not be appropriate for the analysis of nonsmokers. If one is interested in building a binary data. The outcome under study is model to predict the variation of a proportion as expressed as a ratio of two variables. a function of dependent variables, then PROC Alternatively, we could use a dichotomous LOGISTIC would seem to be the clear choice variable taking values 0 or 1 to indicate whether Statistics, Data Analysis, and Data Mining or not an individual beetle was killed and model “working correlation matrix”. This working that variable as a function of dosage of carbon correlation matrix reflects the analyst disulphide. assumption about the correlation structure between observations from the same cluster. 4. Analysis of Count data and The correlation structure can take many forms. Correlated Data One of the most commonly made assumptions is that the correlation within cluster is The main advantage of PROC GENMOD exchangeable. That is between any two compared to other data analysis procedures is elements of a cluster the correlation is the same. the fact that it can fit complex models that Other correlation structures that are available cannot be fitted in other procedures for linear include: independent, autoregressive structure or models such as GLM or Logistic. In this m-dependent. The user can also specify a fixed section, we discuss the use of proc GENMOD correlation matrix. for the analysis of count data and correlated data. PROC GENMOD can fit data arising from The necessary information for Proc GENMOD a number of distributions. If the distribution is to model the correlation in the data is inputted not available as an option, the user can even through the REPEATED statement. There are specify that distribution. One of the Options in the REPEATED statement to specify distributions available in PROC genmod is the the form of the correlation structure as well as Poisson distribution which is generally used for convergence criteria. The CORR option is count data. Other available distributions include probably the most important option in the Gamma, Inverse Gaussian and Negative REPEATED statement. It is used to specify the Binomial. “working correlation matrix” that was described above and the SUBJECT option identifies the Correlated data can occur as the result of cluster. clustered data. Some examples of correlated data can occur as the result of taking repeated Example measurements on subjects or as a result of subjects belonging to the same cluster. The Paul (1982) reported an experiment in which cluster can be a geographical region, a clinical pregnant rabbits were dosed with an unspecified site in a multi-site studies or a litter in a toxicity toxic substance. The foetuses were observed for study. Failure to account for the correlation in skeletal and visceral abnormalities. The cluster the data can result in underestimating the here is the litter. Typically, in these types of variance which will lead to artificially low p- experiments, it is assumed that the correlation values. Several methods can be used to analyze within litter is exchangeable, that is between any correlated data including general linear two littermates the correlation is the same. The multivariate models (GLMMs) or Linear Mixed data could be analyzed with the following Models. GLMM models can be fitted in PROC syntax: GLM and Linear mixed models can be fitted using PROC MIXED. Generalized estimating Proc genmod equations (GEE) methods which are used in data=rabbit_toxicity; GENMOD to account for correlated data in Class litter dose ; many situations may be preferred for various Model malformation=dose / link=logit ; reasons (such as missing data or non-normality) Repeated subject=litter/sorted over the other methods mentioned above. type=exch; For correlated data, the analyst must specify a (The sorted option tells SAS that the data is Statistics, Data Analysis, and Data Mining properly sorted by subject.) SAS macro. For some of these endpoints the use of the PROC MIXED procedure might have 5. Example of the use of PROC GENMOD been preferable. We opted against that option to analyze various types of endpoints since doing so would have required that we use from a toxicity study other procedures for endpoints that don’t follow the normal distribution (binary and count data). In analyzing data from toxicity studies, my Another reason for not using PROC MIXED is preference has been to use PROC GENMOD the fact that for some endpoints convergence over other procedures. In this section, I will can be difficult to achieve and would require a give a brief description of the data from these few trials and errors. Given that the number of toxicology experiments. I will also discuss why endpoints to be analyzed can be more than 300, we prefer to use PROC GENMOD over other it is not possible to fit the endpoints one at a procedures and how we use it. time. A toxicology study can best be seen as a number To handle the large number of endpoints coming of independent experiments of the effect of the from these studies, we manipulated the data so same explanatory variable. For example to that it could be sorted by endpoint, cluster and study the health effect of a given chemical, an dose group. With the data sorted in this manner, experiment might be conducted to investigate the different type of endpoints (binary, the effect of this chemical on male reproductive continuous and count) were grouped together organs and another experiment with the same and analyzed in the same call to PROC chemical would investigate its effect on female GENMOD. For example, the continuous and reproductive organs. In most of these count endpoints from an experiment where the experiments such as the example from the data was correlated would be handled by the previous section the data will be correlated and following SAS codes: in some others, the data will be uncorrelated. For example in the reproductive toxicity studies /* Correlated Data */ conducted by the national institute of environmental health sciences (NIEHS) in some /* Continuous Endpoints */ experiments, rodents selected from different proc genmod litters are given the toxic substance and then are data=toxdata(where=(count=0)); sacrificed. In these types of experiments, we class dose litter; would consider the data to be uncorrelated and model outcome=dose/type3 traditional ANOVA methods could be used. link=identity covb; The endpoints collected can be binary (such as repeated subject=litter malformation), continuous (such as organ /type=exch maxiter=25000 covb weights) or count (such as sperm count). Even corrb; from the same experiment, there may be by endpt; different types of endpoints. /* Count Endpoints */ In analyzing these data, we have preferred to use proc genmod PROC GENMOD over other procedures. The data=toxdata(where=(count=1)); main reason being that with the versatility of class dose litter; PROC GENMOD, we can handle correlated and model outcome=dose/type3 uncorrelated data regardless of the type of d=poisson covb; endpoints. This makes it easier to analyze all of repeated the endpoints from an experiment using a single subject=litter/type=exch maxiter=25000 covb corrb; Statistics, Data Analysis, and Data Mining by endpt; Neter J., Wasserman J. and Kutner M. (1990). Applied Linear Statistical Models. Boston: In the SAS macro we have a macro variable to Irwin. identify whether the observations from the experiments are correlated or not. For an Orelien et al. (2000). Multiple Comparison with experiment where the data were uncorrelated, a control in GEE models using the SAS System. the SAS code below would be executed by the Presented at the SAS User Group International macro: Conference in Indianapolis. /* Uncorrelated Data */ Stokes M.E., Davis C.S. and Koch G.G. (1995). /* Continuous Endpoints */ Categorical Data Analysis Using the SAS System. Cary, SAS Institute, Inc. proc genmod data=toxdata(where=(count=0)); Contact Information class dose; model outcome=dose/type3 Your questions and comments are welcome. link=identity covb; Please contact: by endpt; Jean G. Orelien /* Count Endpoints */ Analytical Sciences, Inc. 2605 Meridian Pkwy. proc genmod Durham, NC 27713 data=toxdata(where=(count=1)); class dose litter; Work Phone: (919)544-8500 (ext. 125) model outcome=dose/type3 Fax: (919)544-7307 d=poisson covb; Email: firstname.lastname@example.org by endpt; Web: http://www.asciences.com Conclusion Data that can be analyzed in PROC ANOVA, PROC GLM, PROC REG, or PROC LOGISTIC can also be handled in PROC GENMOD. There are instances where the analyst is familiar enough with the data and the only output of interest are the parameter estimates, standard errors , p-values and confidence intervals. In these instances, the use of PROC GENMOD might be preferred. We have given an example from toxicology data where using PROC GENMOD is more efficient because of the different type of endpoints that have to be analyzed from these experiments. References Agresti, A. Categorical Data Analysis (1990). New York: Wiley.