Party on_

Document Sample
Party on_ Powered By Docstoc
					14                                                                          C ONTRIBUTED R ESEARCH A RTICLES

Party on!
A New, Conditional Variable-Importance Measure              tails).
for Random Forests Available in the party Package               To overcome this weakness of the early tree al-
                                                            gorithms, new algorithms have been developed that
by Carolin Strobl, Torsten Hothorn and Achim Zeileis        do not artificially favor splits in variables with many
                                                            categories or continuous variables. In R such an
  Abstract: Random forests are one of the most              unbiased tree algorithm is available in the ctree
  popular statistical learning algorithms, and a            function for conditional inference trees in the party
  variety of methods for fitting random forests              package (Hothorn et al., 2006). The package also
  and related recursive partitioning approaches is          provides a random forest implementation cforest
  available in R. This paper points out two impor-          based on unbiased trees, which enables learning un-
  tant features of the random forest implementa-            biased forests (Strobl et al., 2007b).
  tion cforest available in the party package: The              Unbiased variable selection is the key to reli-
  resulting forests are unbiased and thus prefer-           able prediction and interpretability in both individ-
  able to the randomForest implementation avail-            ual trees and forests. However, while a single tree’s
  able in randomForest if predictor variables are           interpretation is straightforward, in random forests
  of different types. Moreover, a conditional per-          an extra effort is necessary to assess the importance
  mutation importance measure has recently been             of each predictor in the complex ensemble of trees.
  added to the party package, which can help eval-              This issue is typically addressed by means of
  uate the importance of correlated predictor vari-         variable-importance measures such as Gini impor-
  ables. The rationale of this new measure is illus-        tance and the “mean decrease in accuracy” or “per-
  trated and hands-on advice is given for the usage         mutation” importance, available in randomForest
  of recursive partitioning tools in R.                     in the importance() function (with type = 2 and
                                                            type = 1, respectively). Similarly, a permutation-
                                                            importance measure for cforest is available via
Recursive partitioning methods are amongst the              varimp() in party.
most popular and widely used statistical learning
                                                                Unfortunately, variable-importance measures in
tools for nonparametric regression and classifica-
                                                            random forests are subject to the same bias in fa-
tion. Random forests in particular, which can deal
                                                            vor of variables with many categories and continu-
with large numbers of predictor variables even in
                                                            ous variables that affects variable selection in single
the presence of complex interactions, are being ap-
                                                            trees, and also to a new source of bias induced by the
plied successfully in many scientific fields (see, e.g.,
                                                            resampling scheme (Strobl et al., 2007b). Both prob-
Lunetta et al., 2004; Strobl et al., 2009, and the ref-
                                                            lems can be addressed in party to guarantee unbi-
erences therein for applications in genetics and so-
                                                            ased variable selection and variable importance for
cial sciences).        Thus, it is not surprising that
                                                            predictor variables of different types.
there is a variety of recursive partitioning tools avail-
                                                                Even though this refined approach can provide
able in R (see
                                                            reliable variable-importance measures in many ap-
MachineLearning for an overview).
                                                            plications, the original permutation importance can
    The scope of recursive partitioning methods in R
                                                            be misleading in the case of correlated predictors.
ranges from the standard classification and regres-
                                                            Therefore, Strobl et al. (2008) suggested a solution
sion trees available in rpart (Therneau et al., 2008)
                                                            for this problem in the form of a new, conditional
to the reference implementation of random forests
                                                            permutation-importance measure. Starting from ver-
(Breiman, 2001) available in randomForest (Liaw
                                                            sion 0.9-994, this new measure is available in the
and Wiener, 2002, 2008). Both methods are popu-
                                                            party package.
lar in applied research, and several extensions and
                                                                The rationale and usage of this new measure is
refinements have been suggested in the statistical lit-
                                                            outlined in the following sections and illustrated by
erature in recent years.
                                                            means of a toy example.
    One particularly important improvement was
the introduction of unbiased tree algorithms, which
overcome the major weak spot of the classical
approaches available in rpart and randomForest:
                                                            Random forest variable-importance
variable-selection bias. The term variable-selection bias   measures
refers to the fact that in standard tree algorithms vari-
able selection is biased in favor of variables offer-       Permutation importance, which is available in ran-
ing many potential cut-points, so that variables with       domForest and party, is based on a random permu-
many categories and continuous variables are artifi-         tation of the predictor variables, as described in more
cially preferred (see, e.g, Kim and Loh, 2001; Shih,        detail below.
2002; Hothorn et al., 2006; Strobl et al., 2007a, for de-       The alternative variable-importance measure

The R Journal Vol. 1/2, December 2009                                                             ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES                                                                                               15

available in randomForest, Gini importance, is based        ences therein) suggest that the original permutation-
on the Gini gain criterion employed in most tradi-          importance measure often assigns higher scores to
tional classification tree algorithms. However, the          correlated predictors.
Gini importance has been shown to carry forward                 In contrast, partial correlation coefficients (like
the bias of the underlying Gini-gain splitting crite-       the coefficients in linear regression models) measure
rion (see, e.g., Kim and Loh, 2001; Strobl et al., 2007a;   the importance of a variable given the other predic-
Hothorn et al., 2006) when predictor variables vary         tor variables in the model. The advantage of such
in their number of categories or scale of measure-          a partial, or conditional, approach is illustrated by
ment (Strobl et al., 2007b). Therefore, it is not rec-      means of a toy example: The data set readingSkills
ommended in these situations.                               is an artificial data set generated by means of a lin-
    Permutation importance, on the other hand,              ear model. The response variable contains hypothet-
is a reliable measure of variable importance for            ical scores on a test of reading skills for 200 school
uncorrelated predictors when sub-sampling with-             children. Potential predictor variables in the data set
out replacement — instead of bootstrap sampling             are the age of the child, whether the child is a native
— and unbiased trees are used in the construc-              speaker of the test language and the shoe size of the
tion of the forest (Strobl et al., 2007b). Accord-          child.
ingly, the default settings for the control parame-             Obviously, the latter is not a sensible predictor
ters cforest_control have been pre-defined to the            of reading skills (and was actually simulated not to
default version cforest_unbiased to guarantee sub-          have any effect on the response) — but with respect
sampling without replacement and unbiased indi-             to marginal (as opposed to partial) correlations, shoe
vidual trees in fitting random forests with the party        size is highly correlated with the test score. Of course
package.                                                    this spurious correlation is only due to the fact that
    The rationale of the original permutation-              both shoe size and test score are associated with the
importance measure is the following: By randomly            underlying variable age.
permuting the predictor variable X j , its original as-         In this simple problem, a linear model would be
sociation with the response Y is broken. When the           perfectly capable of identifying the original coeffi-
permuted variable X j , together with the remaining         cients of the predictor variables (including the fact
non-permuted predictor variables, is used to predict        that shoe size has no effect on reading skills once
the response for the out-of-bag observations, the pre-      the truly relevant predictor variable age is included
diction accuracy (i.e., the number of correctly classi-     in the model). However, the cforest permutation-
fied observations in classification, or respectively the      importance measure is mislead by the spurious cor-
mean squared error in regression) decreases sub-            relation and assigns a rather high importance value
stantially if the original variable X j was associated      to the nonsense-variable shoe size:
with the response. Thus, Breiman (2001) suggests
                                                            > library("party")
the difference in prediction accuracy before and after      > set.seed(42)
permuting X j , averaged over all trees, as a measure       > <- cforest(score ~ .,
for variable importance.                                    +    data = readingSkills, control =
    In standard implementations of random forests,          +    cforest_unbiased(mtry = 2, ntree = 50))
such as randomForest in R, an additional scaled ver-
sion of permutation importance (often called the z-         > set.seed(42)
score), which is computed by dividing the raw im-           > varimp(
portance by its standard error, is provided (for exam-
                                                            nativeSpeaker                  age            shoeSize
ple by importance(obj, type = 2, scale = TRUE)                   12.62036             74.52034            17.97287
in randomForest). Note, however, that the results
of Strobl and Zeileis (2008) show that the z-score is           The reason for this odd behavior can be found in
not suitable for significance tests and that raw im-         the way the predictor variables are permuted in the
portance has better statistical properties.                 computation of the importance measure: Strobl et al.
                                                            (2008) show that the original approach, where one
                                                            predictor variable X j is permuted against both the re-
Why conditional importance?                                 sponse Y and the remaining (one or more) predictor
                                                            variables Z = X1 , . . . , X j−1 , X j+1 , . . . , X p , as illustrated
The original permutation-importance measure can,            in Figure 1, corresponds to a pattern of independence
for reasons outlined below, be considered as a              between X j and both Y and Z.
marginal measure of importance. In this sense, it has           From a theoretical point of view, this means that a
the same property as, e.g., a marginal correlation co-      high value of the importance measure can be caused
efficient: A variable that has no effect of its own, but     by a violation either of the independence between
is correlated with a relevant predictor variable, can       X j and Y or of the independence between X j and
receive a high importance score. Accordingly, em-           Z, even though the latter is not of interest here. For
pirical results (see Strobl et al., 2008, and the refer-    practical applications, this means that a variable X j

The R Journal Vol. 1/2, December 2009                                                                         ISSN 2073-4859
16                                                                           C ONTRIBUTED R ESEARCH A RTICLES

that is correlated with an important predictor Z can        be more familiar. However, whether a marginal or
appear more important than an uncorrelated vari-            conditional importance measure is to be preferred
able, even if X j has no effect of its own.                 depends on the actual research question.
                                                                Current research also investigates the impact that
                      Y      Xj         Z
                                                            the choice of the tuning parameter mtry, which reg-
                      y1   xπ j (1),j   z1
                                                            ulates the number of randomly preselected predictor
                       .        .
                                .        .
                                         .                  variables that can be chosen in each split (cf. Strobl
                       .        .        .
                      yi   xπ j (i),j   zi                  et al., 2008), and parameters regulating the depth of
                       .        .        .                  the trees have on variable importance.
                       .        .
                                .        .
                      yn   xπ j (n),j   zn

Figure 1:   Permutation scheme for the original             How is the conditioning grid de-
permutation-importance measure.                             fined technically?
                 Y          Xj             Z
                                                            Conditioning is straightforward whenever the vari-
                y1    xπ j|Z=a (1),j    z1 = a              ables to be conditioned on, Z, are categorical (cf., e.g.,
                y3    xπ j|Z=a (3),j    z3 = a              Nason et al., 2004). However, conditioning on con-
                y27   xπ j|Z=a (27),j   z27 = a             tinuous variables, which may entail as many differ-
                y6    xπ j|Z=b (6),j    z6 = b              ent values as observations in the sample, would pro-
                                                            duce cells with very sparse counts — which would
                y14   xπ j|Z=b (14),j   z14 = b
                                                            make permuting the values of X j within each cell
                y21   xπ j|Z=b (21),j   z21 = b             rather pointless. Thus, in order to create cells of rea-
                 .           .
                             .              .
                                            .               sonable size for conditioning, continuous variables
                 .           .              .
                                                            need to be discretized.
Figure 2: Permutation scheme for the conditional                As a straightforward discretization strategy for
permutation importance.                                     random forests, Strobl et al. (2008) suggest defining
                                                            the conditioning grid by means of the partition of the
    The aim to reflect only the impact of X j itself in      feature space induced by each individual tree. This
predicting the response Y, rather than its correlations     grid can be used to conditionally permute the val-
with other predictor variables, can be better achieved      ues of X j within cells defined by combinations of Z,
by means of a conditional importance measure in the         where Z can contain potentially large sets of covari-
spirit of a partial correlation: We want to measure         ates of different scales of measurement.
the association between X j and Y given the corre-              The main advantages of this approach are that
lation structure between X j and the other predictor        this partition has already been learned from the data
variables in the data set.                                  during model fitting, that it can contain splits in cate-
    To meet this aim, Strobl et al. (2008) suggest a con-   gorical, ordered and continuous predictor variables,
ditional permutation scheme, where X j is permuted          and that it can thus serve as an internally available
only within groups of observations with Z = z in            means for discretizing the feature space. For ease
order to preserve the correlation structure between         of computation, the conditioning grid employed in
X j and the other predictor variables, as illustrated in    varimp uses all cut-points as bisectors of the sample
Figure 2.                                                   space (the same approach is followed by Nason et al.,
    With this new, conditional permutation scheme,          2004).
the importance measure is able to reveal the spuri-             The set of variables Z to be conditioned on should
ous correlation between shoe size and reading skills:       contain all variables that are correlated with the cur-
> set.seed(42)                                              rent variable of interest X j . In the varimp func-
> varimp(, conditional =                    tion, this is assured by the small default value 0.2
+    TRUE)                                                  of the threshold argument: By default, all vari-
                                                            ables whose correlation with X j meets the condi-
nativeSpeaker               age              shoeSize
                                                            tion 1 − p−value > 0.2 are used for conditioning. A
    11.161887         44.388450              1.087162
                                                            larger value of threshold would have the effect that
    Only by means of conditional importance it be-          only those variables that are strongly correlated with
comes clear that the covariate native speaker is actu-      X j would be used for conditioning, but would also
ally more relevant for predicting the test score than       lower the computational burden.
shoe size is, whose conditional effect is negligible.           Note that the same permutation tests that are
Thus, the conditional importance mimics the behav-          used for split selection in the tree building process
ior of partial correlations or linear regression coeffi-     (Hothorn et al., 2006) are used here to measure the
cients, the interpretation of which many readers may        association between X j and the remaining covariates.

The R Journal Vol. 1/2, December 2009                                                               ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES                                                                                17

A short recipe for fitting random                             framework. Journal of Computational and Graphical
                                                             Statistics, 15(3):651–674, 2006.
forests and computing variable im-
portance measures with R                                   H. Kim and W. Loh. Classification trees with unbi-
                                                             ased multiway splits. Journal of the American Statis-
To conclude, we would like to summarize the appli-           tical Association, 96(454):589–604, 2001.
cation of conditional variable importance and gen-         A. Liaw and M. Wiener. Classification and regression
eral issues in fitting random forests with R. Depend-         by randomForest. R News, 2(3):18–22, 2002.
ing on certain characteristics of your data set, we sug-
gest the following approaches:                             A. Liaw and M. Wiener. randomForest: Breiman
                                                             and Cutler’s Random Forests for Classification and Re-
   • If all predictor variables are of the same type
                                                             gression, 2008. URL
     (for example: all continuous or all unordered
                                                             package=randomForest. R package version 4.5-28.
     categorical with the same number of cate-
     gories), use either randomForest (randomFor-          K. L. Lunetta, L. B. Hayward, J. Segal, and P. V.
     est) or cforest (party). While randomForest is          Eerdewegh.     Screening large-scale association
     computationally faster, cforest is safe even for        study data: Exploiting interactions using random
     variables of different types.                           forests. BMC Genetics, 5:32, 2004.
      For predictor variables of the same type,
                                                           M. Nason, S. Emerson, and M. Leblanc. CARTscans:
      Gini importance, importance(obj, type = 2),
                                                            A tool for visualizing complex models. Journal
      or permutation importance, importance(obj,
                                                            of Computational and Graphical Statistics, 13(4):1–19,
      type = 1), available for randomForest, and
      permutation importance, varimp(obj), avail-
      able for cforest, are all adequate importance        Y.-S. Shih. Regression trees with unbiased variable
      measures.                                               selection. Statistica Sinica, 12:361–386, 2002.
   • If the predictor variables are of different types     C. Strobl and A. Zeileis. Danger: High power! – Ex-
     (for example: different scales of measurement,          ploring the statistical properties of a test for ran-
     different numbers of categories), use cforest           dom forest variable importance. In Proceedings of
     (party) with the default option controls =              the 18th International Conference on Computational
     cforest_unbiased and permutation impor-                 Statistics, Porto, Portugal, 2008.
     tance varimp(obj).
                                                           C. Strobl, A.-L. Boulesteix, and T. Augustin. Un-
   • If the predictor variables are correlated, de-          biased split selection for classification trees based
     pending on your research question, condi-               on the Gini index. Computational Statistics & Data
     tional importance, available via varimp(obj,            Analysis, 52(1):483–501, 2007a.
     conditional = TRUE) for cforest (party), can
     add to the understanding of your data.                C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn.
   General remarks:                                          Bias in random forest variable importance mea-
                                                             sures: Illustrations, sources and a solution. BMC
   • Note that the default settings for mtry differ in       Bioinformatics, 8:25, 2007b.
     randomForest and cforest: In randomForest
     the default setting for classification, e.g., is       C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and
     floor(sqrt(ncol(x))), while in cforest it is            A. Zeileis. Conditional variable importance for
     fixed to the value 5 for technical reasons.              random forests. BMC Bioinformatics, 9:307, 2008.

   • Always check whether you get the same results         C. Strobl, J. Malley, and G. Tutz. An introduction
     with a different random seed before interpret-          to recursive partitioning: Rationale, application
     ing the variable importance ranking!                    and characteristics of classification and regression
      If the ranking of even the top-scoring predictor       trees, bagging and random forests. Psychological
      variables depends on the choice of the random          Methods, 2009. In press.
      seed, increase the number of trees (argument         T. M. Therneau, B. Atkinson, and B. D. Ripley. rpart:
      ntree in randomForest and cforest_control).             Recursive partitioning. 2008. URL http://CRAN.
                                                     R package ver-
                                                              sion 3.1-41.
                                                           Carolin Strobl
L. Breiman. Random forests. Machine Learning, 45(1):
                                                           Department of Statistics
   5–32, 2001.
T. Hothorn, K. Hornik, and A. Zeileis. Unbiased            Munich, Germany
  recursive partitioning: A conditional inference

The R Journal Vol. 1/2, December 2009                                                             ISSN 2073-4859

Shared By: