Document Sample

14 C ONTRIBUTED R ESEARCH A RTICLES Party on! A New, Conditional Variable-Importance Measure tails). for Random Forests Available in the party Package To overcome this weakness of the early tree al- gorithms, new algorithms have been developed that by Carolin Strobl, Torsten Hothorn and Achim Zeileis do not artiﬁcially favor splits in variables with many categories or continuous variables. In R such an Abstract: Random forests are one of the most unbiased tree algorithm is available in the ctree popular statistical learning algorithms, and a function for conditional inference trees in the party variety of methods for ﬁtting random forests package (Hothorn et al., 2006). The package also and related recursive partitioning approaches is provides a random forest implementation cforest available in R. This paper points out two impor- based on unbiased trees, which enables learning un- tant features of the random forest implementa- biased forests (Strobl et al., 2007b). tion cforest available in the party package: The Unbiased variable selection is the key to reli- resulting forests are unbiased and thus prefer- able prediction and interpretability in both individ- able to the randomForest implementation avail- ual trees and forests. However, while a single tree’s able in randomForest if predictor variables are interpretation is straightforward, in random forests of different types. Moreover, a conditional per- an extra effort is necessary to assess the importance mutation importance measure has recently been of each predictor in the complex ensemble of trees. added to the party package, which can help eval- This issue is typically addressed by means of uate the importance of correlated predictor vari- variable-importance measures such as Gini impor- ables. The rationale of this new measure is illus- tance and the “mean decrease in accuracy” or “per- trated and hands-on advice is given for the usage mutation” importance, available in randomForest of recursive partitioning tools in R. in the importance() function (with type = 2 and type = 1, respectively). Similarly, a permutation- importance measure for cforest is available via Recursive partitioning methods are amongst the varimp() in party. most popular and widely used statistical learning Unfortunately, variable-importance measures in tools for nonparametric regression and classiﬁca- random forests are subject to the same bias in fa- tion. Random forests in particular, which can deal vor of variables with many categories and continu- with large numbers of predictor variables even in ous variables that affects variable selection in single the presence of complex interactions, are being ap- trees, and also to a new source of bias induced by the plied successfully in many scientiﬁc ﬁelds (see, e.g., resampling scheme (Strobl et al., 2007b). Both prob- Lunetta et al., 2004; Strobl et al., 2009, and the ref- lems can be addressed in party to guarantee unbi- erences therein for applications in genetics and so- ased variable selection and variable importance for cial sciences). Thus, it is not surprising that predictor variables of different types. there is a variety of recursive partitioning tools avail- Even though this reﬁned approach can provide able in R (see http://CRAN.R-project.org/view= reliable variable-importance measures in many ap- MachineLearning for an overview). plications, the original permutation importance can The scope of recursive partitioning methods in R be misleading in the case of correlated predictors. ranges from the standard classiﬁcation and regres- Therefore, Strobl et al. (2008) suggested a solution sion trees available in rpart (Therneau et al., 2008) for this problem in the form of a new, conditional to the reference implementation of random forests permutation-importance measure. Starting from ver- (Breiman, 2001) available in randomForest (Liaw sion 0.9-994, this new measure is available in the and Wiener, 2002, 2008). Both methods are popu- party package. lar in applied research, and several extensions and The rationale and usage of this new measure is reﬁnements have been suggested in the statistical lit- outlined in the following sections and illustrated by erature in recent years. means of a toy example. One particularly important improvement was the introduction of unbiased tree algorithms, which overcome the major weak spot of the classical approaches available in rpart and randomForest: Random forest variable-importance variable-selection bias. The term variable-selection bias measures refers to the fact that in standard tree algorithms vari- able selection is biased in favor of variables offer- Permutation importance, which is available in ran- ing many potential cut-points, so that variables with domForest and party, is based on a random permu- many categories and continuous variables are artiﬁ- tation of the predictor variables, as described in more cially preferred (see, e.g, Kim and Loh, 2001; Shih, detail below. 2002; Hothorn et al., 2006; Strobl et al., 2007a, for de- The alternative variable-importance measure The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 C ONTRIBUTED R ESEARCH A RTICLES 15 available in randomForest, Gini importance, is based ences therein) suggest that the original permutation- on the Gini gain criterion employed in most tradi- importance measure often assigns higher scores to tional classiﬁcation tree algorithms. However, the correlated predictors. Gini importance has been shown to carry forward In contrast, partial correlation coefﬁcients (like the bias of the underlying Gini-gain splitting crite- the coefﬁcients in linear regression models) measure rion (see, e.g., Kim and Loh, 2001; Strobl et al., 2007a; the importance of a variable given the other predic- Hothorn et al., 2006) when predictor variables vary tor variables in the model. The advantage of such in their number of categories or scale of measure- a partial, or conditional, approach is illustrated by ment (Strobl et al., 2007b). Therefore, it is not rec- means of a toy example: The data set readingSkills ommended in these situations. is an artiﬁcial data set generated by means of a lin- Permutation importance, on the other hand, ear model. The response variable contains hypothet- is a reliable measure of variable importance for ical scores on a test of reading skills for 200 school uncorrelated predictors when sub-sampling with- children. Potential predictor variables in the data set out replacement — instead of bootstrap sampling are the age of the child, whether the child is a native — and unbiased trees are used in the construc- speaker of the test language and the shoe size of the tion of the forest (Strobl et al., 2007b). Accord- child. ingly, the default settings for the control parame- Obviously, the latter is not a sensible predictor ters cforest_control have been pre-deﬁned to the of reading skills (and was actually simulated not to default version cforest_unbiased to guarantee sub- have any effect on the response) — but with respect sampling without replacement and unbiased indi- to marginal (as opposed to partial) correlations, shoe vidual trees in ﬁtting random forests with the party size is highly correlated with the test score. Of course package. this spurious correlation is only due to the fact that The rationale of the original permutation- both shoe size and test score are associated with the importance measure is the following: By randomly underlying variable age. permuting the predictor variable X j , its original as- In this simple problem, a linear model would be sociation with the response Y is broken. When the perfectly capable of identifying the original coefﬁ- permuted variable X j , together with the remaining cients of the predictor variables (including the fact non-permuted predictor variables, is used to predict that shoe size has no effect on reading skills once the response for the out-of-bag observations, the pre- the truly relevant predictor variable age is included diction accuracy (i.e., the number of correctly classi- in the model). However, the cforest permutation- ﬁed observations in classiﬁcation, or respectively the importance measure is mislead by the spurious cor- mean squared error in regression) decreases sub- relation and assigns a rather high importance value stantially if the original variable X j was associated to the nonsense-variable shoe size: with the response. Thus, Breiman (2001) suggests > library("party") the difference in prediction accuracy before and after > set.seed(42) permuting X j , averaged over all trees, as a measure > readingSkills.cf <- cforest(score ~ ., for variable importance. + data = readingSkills, control = In standard implementations of random forests, + cforest_unbiased(mtry = 2, ntree = 50)) such as randomForest in R, an additional scaled ver- sion of permutation importance (often called the z- > set.seed(42) score), which is computed by dividing the raw im- > varimp(readingSkills.cf) portance by its standard error, is provided (for exam- nativeSpeaker age shoeSize ple by importance(obj, type = 2, scale = TRUE) 12.62036 74.52034 17.97287 in randomForest). Note, however, that the results of Strobl and Zeileis (2008) show that the z-score is The reason for this odd behavior can be found in not suitable for signiﬁcance tests and that raw im- the way the predictor variables are permuted in the portance has better statistical properties. computation of the importance measure: Strobl et al. (2008) show that the original approach, where one predictor variable X j is permuted against both the re- Why conditional importance? sponse Y and the remaining (one or more) predictor variables Z = X1 , . . . , X j−1 , X j+1 , . . . , X p , as illustrated The original permutation-importance measure can, in Figure 1, corresponds to a pattern of independence for reasons outlined below, be considered as a between X j and both Y and Z. marginal measure of importance. In this sense, it has From a theoretical point of view, this means that a the same property as, e.g., a marginal correlation co- high value of the importance measure can be caused efﬁcient: A variable that has no effect of its own, but by a violation either of the independence between is correlated with a relevant predictor variable, can X j and Y or of the independence between X j and receive a high importance score. Accordingly, em- Z, even though the latter is not of interest here. For pirical results (see Strobl et al., 2008, and the refer- practical applications, this means that a variable X j The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 16 C ONTRIBUTED R ESEARCH A RTICLES that is correlated with an important predictor Z can be more familiar. However, whether a marginal or appear more important than an uncorrelated vari- conditional importance measure is to be preferred able, even if X j has no effect of its own. depends on the actual research question. Current research also investigates the impact that Y Xj Z the choice of the tuning parameter mtry, which reg- y1 xπ j (1),j z1 ulates the number of randomly preselected predictor . . . . . . variables that can be chosen in each split (cf. Strobl . . . yi xπ j (i),j zi et al., 2008), and parameters regulating the depth of . . . the trees have on variable importance. . . . . . . yn xπ j (n),j zn Figure 1: Permutation scheme for the original How is the conditioning grid de- permutation-importance measure. ﬁned technically? Y Xj Z Conditioning is straightforward whenever the vari- y1 xπ j|Z=a (1),j z1 = a ables to be conditioned on, Z, are categorical (cf., e.g., y3 xπ j|Z=a (3),j z3 = a Nason et al., 2004). However, conditioning on con- y27 xπ j|Z=a (27),j z27 = a tinuous variables, which may entail as many differ- y6 xπ j|Z=b (6),j z6 = b ent values as observations in the sample, would pro- duce cells with very sparse counts — which would y14 xπ j|Z=b (14),j z14 = b make permuting the values of X j within each cell y21 xπ j|Z=b (21),j z21 = b rather pointless. Thus, in order to create cells of rea- . . . . . . sonable size for conditioning, continuous variables . . . need to be discretized. Figure 2: Permutation scheme for the conditional As a straightforward discretization strategy for permutation importance. random forests, Strobl et al. (2008) suggest deﬁning the conditioning grid by means of the partition of the The aim to reﬂect only the impact of X j itself in feature space induced by each individual tree. This predicting the response Y, rather than its correlations grid can be used to conditionally permute the val- with other predictor variables, can be better achieved ues of X j within cells deﬁned by combinations of Z, by means of a conditional importance measure in the where Z can contain potentially large sets of covari- spirit of a partial correlation: We want to measure ates of different scales of measurement. the association between X j and Y given the corre- The main advantages of this approach are that lation structure between X j and the other predictor this partition has already been learned from the data variables in the data set. during model ﬁtting, that it can contain splits in cate- To meet this aim, Strobl et al. (2008) suggest a con- gorical, ordered and continuous predictor variables, ditional permutation scheme, where X j is permuted and that it can thus serve as an internally available only within groups of observations with Z = z in means for discretizing the feature space. For ease order to preserve the correlation structure between of computation, the conditioning grid employed in X j and the other predictor variables, as illustrated in varimp uses all cut-points as bisectors of the sample Figure 2. space (the same approach is followed by Nason et al., With this new, conditional permutation scheme, 2004). the importance measure is able to reveal the spuri- The set of variables Z to be conditioned on should ous correlation between shoe size and reading skills: contain all variables that are correlated with the cur- > set.seed(42) rent variable of interest X j . In the varimp func- > varimp(readingSkills.cf, conditional = tion, this is assured by the small default value 0.2 + TRUE) of the threshold argument: By default, all vari- ables whose correlation with X j meets the condi- nativeSpeaker age shoeSize tion 1 − p−value > 0.2 are used for conditioning. A 11.161887 44.388450 1.087162 larger value of threshold would have the effect that Only by means of conditional importance it be- only those variables that are strongly correlated with comes clear that the covariate native speaker is actu- X j would be used for conditioning, but would also ally more relevant for predicting the test score than lower the computational burden. shoe size is, whose conditional effect is negligible. Note that the same permutation tests that are Thus, the conditional importance mimics the behav- used for split selection in the tree building process ior of partial correlations or linear regression coefﬁ- (Hothorn et al., 2006) are used here to measure the cients, the interpretation of which many readers may association between X j and the remaining covariates. The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 C ONTRIBUTED R ESEARCH A RTICLES 17 A short recipe for ﬁtting random framework. Journal of Computational and Graphical Statistics, 15(3):651–674, 2006. forests and computing variable im- portance measures with R H. Kim and W. Loh. Classiﬁcation trees with unbi- ased multiway splits. Journal of the American Statis- To conclude, we would like to summarize the appli- tical Association, 96(454):589–604, 2001. cation of conditional variable importance and gen- A. Liaw and M. Wiener. Classiﬁcation and regression eral issues in ﬁtting random forests with R. Depend- by randomForest. R News, 2(3):18–22, 2002. ing on certain characteristics of your data set, we sug- gest the following approaches: A. Liaw and M. Wiener. randomForest: Breiman and Cutler’s Random Forests for Classiﬁcation and Re- • If all predictor variables are of the same type gression, 2008. URL http://CRAN.R-project.org/ (for example: all continuous or all unordered package=randomForest. R package version 4.5-28. categorical with the same number of cate- gories), use either randomForest (randomFor- K. L. Lunetta, L. B. Hayward, J. Segal, and P. V. est) or cforest (party). While randomForest is Eerdewegh. Screening large-scale association computationally faster, cforest is safe even for study data: Exploiting interactions using random variables of different types. forests. BMC Genetics, 5:32, 2004. For predictor variables of the same type, M. Nason, S. Emerson, and M. Leblanc. CARTscans: Gini importance, importance(obj, type = 2), A tool for visualizing complex models. Journal or permutation importance, importance(obj, of Computational and Graphical Statistics, 13(4):1–19, type = 1), available for randomForest, and 2004. permutation importance, varimp(obj), avail- able for cforest, are all adequate importance Y.-S. Shih. Regression trees with unbiased variable measures. selection. Statistica Sinica, 12:361–386, 2002. • If the predictor variables are of different types C. Strobl and A. Zeileis. Danger: High power! – Ex- (for example: different scales of measurement, ploring the statistical properties of a test for ran- different numbers of categories), use cforest dom forest variable importance. In Proceedings of (party) with the default option controls = the 18th International Conference on Computational cforest_unbiased and permutation impor- Statistics, Porto, Portugal, 2008. tance varimp(obj). C. Strobl, A.-L. Boulesteix, and T. Augustin. Un- • If the predictor variables are correlated, de- biased split selection for classiﬁcation trees based pending on your research question, condi- on the Gini index. Computational Statistics & Data tional importance, available via varimp(obj, Analysis, 52(1):483–501, 2007a. conditional = TRUE) for cforest (party), can add to the understanding of your data. C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. General remarks: Bias in random forest variable importance mea- sures: Illustrations, sources and a solution. BMC • Note that the default settings for mtry differ in Bioinformatics, 8:25, 2007b. randomForest and cforest: In randomForest the default setting for classiﬁcation, e.g., is C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and floor(sqrt(ncol(x))), while in cforest it is A. Zeileis. Conditional variable importance for ﬁxed to the value 5 for technical reasons. random forests. BMC Bioinformatics, 9:307, 2008. • Always check whether you get the same results C. Strobl, J. Malley, and G. Tutz. An introduction with a different random seed before interpret- to recursive partitioning: Rationale, application ing the variable importance ranking! and characteristics of classiﬁcation and regression If the ranking of even the top-scoring predictor trees, bagging and random forests. Psychological variables depends on the choice of the random Methods, 2009. In press. seed, increase the number of trees (argument T. M. Therneau, B. Atkinson, and B. D. Ripley. rpart: ntree in randomForest and cforest_control). Recursive partitioning. 2008. URL http://CRAN. R-project.org/package=rpart. R package ver- sion 3.1-41. Bibliography Carolin Strobl L. Breiman. Random forests. Machine Learning, 45(1): Department of Statistics 5–32, 2001. Ludwig-Maximilians-Universität T. Hothorn, K. Hornik, and A. Zeileis. Unbiased Munich, Germany recursive partitioning: A conditional inference carolin.strobl@stat.uni-muenchen.de The R Journal Vol. 1/2, December 2009 ISSN 2073-4859

DOCUMENT INFO

Shared By:

Categories:

Tags:
Party On, the party, How to, Song Lyrics, gonna party, Party Supplies, stay at home, lonely night, birthday parties, Funeral Dress

Stats:

views: | 25 |

posted: | 5/13/2011 |

language: | English |

pages: | 4 |

OTHER DOCS BY ert634

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.