Genmod Sales Data - PDF

Document Sample
Genmod Sales Data - PDF Powered By Docstoc
					                                                                        Statistics, Data Analysis, and Data Mining

                                             Paper 264-26

                            Model Fitting in PROC GENMOD
                                Jean G. Orelien, Analytical Sciences, Inc.

Abstract:                                               The function f is known as the link distribution.
                                                        For ANOVA, the link distribution is the
There are several procedures in the SAS System          identity. Other possible link functions include
for statistical modeling. Most statisticians who        the logit for logistic regression or log for count
use the SAS system are familiar with procedures         data. Whereas in general linear models, it is
such as PROC REG and PROC GLM for fitting               necessary to assume that the errors are
general linear models. However PROC                     independent, have equal variances and are
GENMOD can handle these general linear                  normally distributed, none of these assumptions
models as well as more complex ones such as             are necessary in GLMs.
logistic models, loglinear models or models for
count data. In addition, the main advantage of          In the SAS System, GLMs can be fitted in
PROC GENMOD is that it can accommodate                  PROC GENMOD. But separate procedures
the analysis of correlated data. In this paper, we      exist for certain sub classes of GLM models
will discuss the use of PROC GENMOD to                  such as logistic regression or general linear
analyze simple as well as more complex                  models. For example, PROC GLM can fit
statistical models. When other procedures are           general linear models. Although regression
available to perform the same analysis, we will         analysis can be fitted with PROC GLM, PROC
highlight the options from these procedures that        REG is more specific to this type of analysis.
may be missing in PROC GENMOD but might                 Similarly, there exists also a PROC ANOVA
be of interest to the user. An example is given         that is specific as the name indicates to ANOVA
showing how PROC GENMOD is used to                      models. Another SAS procedure for analyzing a
analyze various types of endpoints (continuous          subclass of GLM models is PROC LOGISTIC.
and count data) from a toxicology experiment.
The materials in this paper should be accessible        In this paper, we will provide an overview of
even to those users with limited data analysis          some of the models that can be fitted with
skills.                                                 PROC GENMOD. When these models can be
                                                        fitted by other SAS procedures, we will outline
1.     Introduction                                     some differences between these procedures and
                                                        GENMOD that the user needs to be aware of.
Generalized linear models (GLMs) include the            In section 2, we discuss the fitting of GLM
most common statistical models used in                  models in GENMOD and other procedures. The
Statistics. This class of models includes general       fitting of logistic models is discussed in section
linear models and logistics models. It should be        3. Other types of models such as those
noted that general linear models include                involving other link distributions besides the
ANOVA models as well as regression analysis.            logit and the identify as well as models for
Complex models such as those arising from               correlated data are discussed in section 4. In
correlated data (repeated measures, clustered           section 5, we provide an example showing how
data) can also be fitted with GLMs. The form of         we have used PROC GENMOD to analyze
a GLM model is given by:                                different types of endpoints from the National
                                                        Toxicology Program (NTP).
        f(Y)'Xβ % ε
                                                                    Statistics, Data Analysis, and Data Mining

2.     Fitting of General Linear                     PROC GENMOD uses maximum likelihood
       Models in GENMOD and Other                    methods. For general linear models, the
                                                     maximum likelihood and the least squares
                                                     methods yield the same estimates.
There are many procedures besides PROC
                                                     There are instances, where the data analyst is
GENMOD in the SAS System for the fitting of
                                                     familiar with the data and may want to use
general linear models. The three most
                                                     PROC GENMOD to create outputs from the
commonly used are PROC ANOVA, PROC
                                                     analysis of general linear models that are
REG and PROC GLM. PROC GLM is the most
                                                     uniform with other outputs from more complex
comprehensive of the three models. Any
                                                     models that can only be analyzed in PROC
analysis of general linear models can be
                                                     GENMOD. An example of that will be given in
performed in this procedure. However, the
                                                     section 5. In general, we suggest that
other two procedures are more efficient or offer
                                                     GENMOD be used for analysis of GLM models
more options for certain subclasses of general
                                                     only in those instances where the analyst wishes
linear models. PROC ANOVA handles analysis
                                                     to obtain coefficient estimates and their
of variance models with balanced designs. For
                                                     variance. In these cases, the analyst should
these models, PROC ANOVA is faster than
                                                     know from prior experience with similar data,
PROC GLM. For regression models, PROC
                                                     that the data is well “behaved” and that no
REG provides many options that are not found
                                                     general linear models assumptions are violated.
in other procedures. Some of the advantages of
PROC REG are that the user can fit several
                                                     We give below some examples of the use of
models with one call of the procedure, there are
                                                     PROC GENMOD for analysis of general linear
also more options for model selections and
diagnostic tools to detect multicollinearity.
Multicollinearity occurs when the independent
                                                     2.1    Example of a Regression analysis
variables in the models are correlated among
themselves. This can lead to large variance for
                                                     Suppose sales for a company in a district can be
the estimated coefficients and affect our
                                                     predicted as a function of the target population
interpretation of these coefficients.
                                                     and the per capita discretionary income. The
                                                     data comes from Applied Linear Statistical
Although, PROC GENMOD can fit any general
                                                     Models from Neter, Wasserman and Kutner
linear model, there are many useful options that
                                                     (page 249). Regression coefficients can be
it does not provide. For example, in an
                                                     obtained with either of the following two
ANOVA model with random effects, one may
be interested in estimating the variance
components. This would not be possible in            proc genmod data=salesdata;
PROC GENMOD. With ANOVA models, one                  model sales=target_population
may also want to compare the means of an             discretionary_income;
independent variable at several levels. As of
version 7.0, there is an LSMEANS statement in        or
PROC GENMOD, but unlike in PROC GLM,
only the least squares estimates can be obtained     proc reg data=salesdata;
with this statement, no statistical comparisons of   model sales=target_population
the means can be made. It should be noted that       discretionary_income;
while the other procedures that we have
mentioned in this section used least squares         2.2 Example of an Analysis of Variance
methods to estimate the coefficient parameters,
                                                                    Statistics, Data Analysis, and Data Mining

Suppose a research laboratory develops a new        because of the options it provides for model
compound for the relief of severe hay fever and     selections. On the other hand, if one is only
wants to compare the effect of the ingredients on   interested in finding out whether an independent
the outcome. The outcome being measured is          variable has a significant effect on the variation
number hours of relief. This hypothetical           of a proportion then the analyst has the choice of
example is also taken from Neter, Kutner and        using either PROC LOGISTIC or PROC
Wasserman (page 722). We could perform this         GENMOD. In earlier versions of the SAS
analysis in PROC GENMOD with the following          System [at least up to version 6.12], there was
syntax:                                             no CLASS or CONTRAST statements in PROC
                                                    LOGISTIC. Thus, for the analysis of categorical
proc genmod data=compounds_data;                    variables one might have preferred PROC
class ingredient1 ingredient2;                      GENMOD over PROC LOGISTIC in earlier
model hours_of_relief=                              versions, since these categorical variables would
ingredient1 ingredient2;                            have to be recoded in a data step prior to the call
                                                    of the LOGISTIC procedure.
The same analysis could also be performed in
PROC GLM:                                           We give here an example of the use of PROC
                                                    GENMOD for the analysis of binary data.
proc glm data=compounds_data;
class ingredient1 ingredient2;
                                                    Example of a logistic regression analysis with
model hours_of_relief=
ingredient1 ingredient2;                            binary data

3.     Fitting of Logistic Models in PROC           Bliss (1935) reports the proportion of beetles
       GENMOD and PROC LOGISTIC                     killed after 5 hours of exposure at various
                                                    concentrations of gaseous carbon disulphide.
Logistic models are of the form:                    To obtain the regression coefficient to model
                                                    proportion of beetles killed as a function of
              p                                     dosage, the following SAS code can be used:
       log       'Xβ % ε                     (2)
These models are appropriate for modeling           proc genmod data=beetle_data;
proportions. Similar to a regular regression, a     number_killed/number_of_beetles=
logistic model can be used to predict the           dosage/link=logit dist=binomial;
proportion p that will be obtained for given
values of the independent variables. But a          PROC Logistic could also be used:
logistic model can also be used to determine
whether an independent variable significantly       proc logistic data=beetle_data;
affects the variation of the dependent variable.    model
In these cases, we are interested in knowing        number_killed/number_of_beetles=
whether the odds of having the outcome are the      dosage/link=logit dist=binomial;
same for all levels of the dependent variable(s).
For example, we may be interested in                Notice that we needed to specify the LINK and
determining if the odds of having a given           DIST option since the defaults values used
disease is the same for smokers and                 would not be appropriate for the analysis of
nonsmokers. If one is interested in building a      binary data. The outcome under study is
model to predict the variation of a proportion as   expressed as a ratio of two variables.
a function of dependent variables, then PROC        Alternatively, we could use a dichotomous
LOGISTIC would seem to be the clear choice          variable taking values 0 or 1 to indicate whether
                                                                         Statistics, Data Analysis, and Data Mining

or not an individual beetle was killed and model         “working correlation matrix”. This working
that variable as a function of dosage of carbon          correlation matrix reflects the analyst
disulphide.                                              assumption about the correlation structure
                                                         between observations from the same cluster.
4.      Analysis of Count data and                       The correlation structure can take many forms.
        Correlated Data                                  One of the most commonly made assumptions is
                                                         that the correlation within cluster is
The main advantage of PROC GENMOD                        exchangeable. That is between any two
compared to other data analysis procedures is            elements of a cluster the correlation is the same.
the fact that it can fit complex models that             Other correlation structures that are available
cannot be fitted in other procedures for linear          include: independent, autoregressive structure or
models such as GLM or Logistic. In this                  m-dependent. The user can also specify a fixed
section, we discuss the use of proc GENMOD               correlation matrix.
for the analysis of count data and correlated
data. PROC GENMOD can fit data arising from              The necessary information for Proc GENMOD
a number of distributions. If the distribution is        to model the correlation in the data is inputted
not available as an option, the user can even            through the REPEATED statement. There are
specify that distribution. One of the                    Options in the REPEATED statement to specify
distributions available in PROC genmod is the            the form of the correlation structure as well as
Poisson distribution which is generally used for         convergence criteria. The CORR option is
count data. Other available distributions include        probably the most important option in the
Gamma, Inverse Gaussian and Negative                     REPEATED statement. It is used to specify the
Binomial.                                                “working correlation matrix” that was described
                                                         above and the SUBJECT option identifies the
Correlated data can occur as the result of               cluster.
clustered data. Some examples of correlated
data can occur as the result of taking repeated          Example
measurements on subjects or as a result of
subjects belonging to the same cluster. The              Paul (1982) reported an experiment in which
cluster can be a geographical region, a clinical         pregnant rabbits were dosed with an unspecified
site in a multi-site studies or a litter in a toxicity   toxic substance. The foetuses were observed for
study. Failure to account for the correlation in         skeletal and visceral abnormalities. The cluster
the data can result in underestimating the               here is the litter. Typically, in these types of
variance which will lead to artificially low p-          experiments, it is assumed that the correlation
values. Several methods can be used to analyze           within litter is exchangeable, that is between any
correlated data including general linear                 two littermates the correlation is the same. The
multivariate models (GLMMs) or Linear Mixed              data could be analyzed with the following
Models. GLMM models can be fitted in PROC                syntax:
GLM and Linear mixed models can be fitted
using PROC MIXED. Generalized estimating                 Proc genmod
equations (GEE) methods which are used in                data=rabbit_toxicity;
GENMOD to account for correlated data in                 Class litter dose ;
many situations may be preferred for various             Model malformation=dose /
                                                         link=logit ;
reasons (such as missing data or non-normality)          Repeated subject=litter/sorted
over the other methods mentioned above.                  type=exch;
For correlated data, the analyst must specify a          (The sorted option tells SAS that the data is
                                                                    Statistics, Data Analysis, and Data Mining

properly sorted by subject.)                         SAS macro. For some of these endpoints the
                                                     use of the PROC MIXED procedure might have
5.   Example of the use of PROC GENMOD               been preferable. We opted against that option
     to analyze various types of endpoints           since doing so would have required that we use
     from a toxicity study                           other procedures for endpoints that don’t follow
                                                     the normal distribution (binary and count data).
In analyzing data from toxicity studies, my          Another reason for not using PROC MIXED is
preference has been to use PROC GENMOD               the fact that for some endpoints convergence
over other procedures. In this section, I will       can be difficult to achieve and would require a
give a brief description of the data from these      few trials and errors. Given that the number of
toxicology experiments. I will also discuss why      endpoints to be analyzed can be more than 300,
we prefer to use PROC GENMOD over other              it is not possible to fit the endpoints one at a
procedures and how we use it.                        time.

A toxicology study can best be seen as a number      To handle the large number of endpoints coming
of independent experiments of the effect of the      from these studies, we manipulated the data so
same explanatory variable. For example to            that it could be sorted by endpoint, cluster and
study the health effect of a given chemical, an      dose group. With the data sorted in this manner,
experiment might be conducted to investigate         the different type of endpoints (binary,
the effect of this chemical on male reproductive     continuous and count) were grouped together
organs and another experiment with the same          and analyzed in the same call to PROC
chemical would investigate its effect on female      GENMOD. For example, the continuous and
reproductive organs. In most of these                count endpoints from an experiment where the
experiments such as the example from the             data was correlated would be handled by the
previous section the data will be correlated and     following SAS codes:
in some others, the data will be uncorrelated.
For example in the reproductive toxicity studies     /* Correlated Data */
conducted by the national institute of
environmental health sciences (NIEHS) in some        /* Continuous Endpoints */
experiments, rodents selected from different
                                                     proc genmod
litters are given the toxic substance and then are   data=toxdata(where=(count=0));
sacrificed. In these types of experiments, we        class dose litter;
would consider the data to be uncorrelated and       model outcome=dose/type3
traditional ANOVA methods could be used.             link=identity covb;
The endpoints collected can be binary (such as       repeated subject=litter
malformation), continuous (such as organ             /type=exch maxiter=25000 covb
weights) or count (such as sperm count). Even        corrb;
from the same experiment, there may be               by endpt;
different types of endpoints.
                                                     /* Count Endpoints */
In analyzing these data, we have preferred to use    proc genmod
PROC GENMOD over other procedures. The               data=toxdata(where=(count=1));
main reason being that with the versatility of       class dose litter;
PROC GENMOD, we can handle correlated and            model outcome=dose/type3
uncorrelated data regardless of the type of          d=poisson covb;
endpoints. This makes it easier to analyze all of    repeated
the endpoints from an experiment using a single      subject=litter/type=exch
                                                     maxiter=25000 covb corrb;
                                                                Statistics, Data Analysis, and Data Mining

by endpt;                                        Neter J., Wasserman J. and Kutner M. (1990).
                                                 Applied Linear Statistical Models. Boston:
In the SAS macro we have a macro variable to     Irwin.
identify whether the observations from the
experiments are correlated or not. For an        Orelien et al. (2000). Multiple Comparison with
experiment where the data were uncorrelated,     a control in GEE models using the SAS System.
the SAS code below would be executed by the      Presented at the SAS User Group International
macro:                                           Conference in Indianapolis.
/* Uncorrelated Data */                          Stokes M.E., Davis C.S. and Koch G.G. (1995).
/* Continuous Endpoints */                       Categorical Data Analysis Using the SAS
                                                 System. Cary, SAS Institute, Inc.
proc genmod
data=toxdata(where=(count=0));                   Contact Information
class dose;
model outcome=dose/type3                         Your questions and comments are welcome.
link=identity covb;                              Please contact:
by endpt;
                                                 Jean G. Orelien
/* Count Endpoints */                            Analytical Sciences, Inc.
                                                 2605 Meridian Pkwy.
proc genmod
                                                 Durham, NC 27713
class dose litter;                               Work Phone: (919)544-8500 (ext. 125)
model outcome=dose/type3                         Fax: (919)544-7307
d=poisson covb;                                  Email:
by endpt;                                        Web:


Data that can be analyzed in PROC ANOVA,
can also be handled in PROC GENMOD. There
are instances where the analyst is familiar
enough with the data and the only output of
interest are the parameter estimates, standard
errors , p-values and confidence intervals. In
these instances, the use of PROC GENMOD
might be preferred. We have given an example
from toxicology data where using PROC
GENMOD is more efficient because of the
different type of endpoints that have to be
analyzed from these experiments.


Agresti, A. Categorical Data Analysis (1990).
New York: Wiley.

Shared By:
Description: Genmod Sales Data document sample