Abstract
We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the default variable-importance measure in random forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed SHAP values and suggest a remedy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Appendix A.1 contains expanded definitions and more thorough notation.
- 2.
A python library is available: https://github.com/slundberg/shap.
- 3.
In all random forest simulations, we choose \(mtry=2, ntrees=100\) and exclude rows with missing Age.
- 4.
For easier notation we have (i) left the multiplier 2 and (ii) omitted an index for the class membership.
References
Adler, A.I., Painsky, A.: Feature importance in gradient boosting trees with cross-validation feature selection. Entropy 24(5), 687 (2022)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(1), 3 (2006)
Grömping, U.: Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63(4), 308–319 (2009)
Grömping, U.: Variable importance in regression models. Wiley Interdiscip. Rev. Comput. Stat. 7(2), 137–152 (2015)
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Kim, H., Loh, W.Y.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96(454), 589–604 (2001)
Li, X., Wang, Y., Basu, S., Kumbier, K., Yu, B.: A debiased mdi feature importance measure for random forests. In: Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8049–8059 (2019)
Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002). https://CRAN.R-project.org/doc/Rnews/
Loecher, M.: Unbiased variable importance for random forests. Commun. Stat. Theory Methods 51, 1–13 (2020)
Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Stat. Sin. 7, 815–840 (1997)
Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020)
Menze, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.A.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)
Nembrini, S., König, I.R., Wright, M.N.: The revival of the Gini importance? Bioinformatics 34(21), 3711–3718 (2018)
Olson, R.S., Cava, W.L., Mustahsan, Z., Varik, A., Moore, J.H.: Data-driven advice for applying machine learning to bioinformatics problems. In: Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium, pp. 192–203. World Scientific (2018)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Saabas, A.: Interpreting random forests (2019). http://blog.datadive.net/interpreting-random-forests/
Saabas, A.: Treeinterpreter library (2019). https://github.com/andosa/treeinterpreter
Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 611–628 (2008)
Shih, Y.S.: A note on split selection bias in classification trees. Comput. Stat. Data Anal. 45(3), 457–466 (2004)
Shih, Y.S., Tsai, H.W.: Variable selection bias in regression trees with constant fits. Comput. Stat. Data Anal. 45(3), 595–607 (2004)
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8, 1–21 (2007). https://doi.org/10.1186/1471-2105-8-25
Strobl, C., Boulesteix, A.L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. 52(1), 483–501 (2007)
Sun, Q.: tree. interpreter: Random Forest Prediction Decomposition and Feature Importance Measure (2020). https://CRAN.R-project.org/package=tree.interpreter. R package version 0.1.1
Wright, M.N., Ziegler, A.: ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77(1), 1–17 (2017). https://doi.org/10.18637/jss.v077.i01
Zhou, Z., Hooker, G.: Unbiased measurement of feature importance in tree-based methods. ACM Trans. Knowl. Discov. Data (TKDD) 15(2), 1–21 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Background and Notations
Definitions needed to understand Eq. (4). (The following paragraph closely follows the definitions in [8].)
Random Forests (RFs) are an ensemble of classification and regression trees, where each tree T defines a mapping from the feature space to the response. Trees are constructed independently of one another on a bootstrapped or subsampled data set \(\mathcal {D}^{(T)}\) of the original data \(\mathcal {D}\). Any node t in a tree T represents a subset (usually a hyper-rectangle) \(R_{t}\) of the feature space. A split of the node t is a pair (k, z) which divides the hyper-rectangle \(R_{t}\) into two hyper-rectangles \(R_{t} \cap \mathbbm {1}\left( X_{k} \le z\right) \) and \(R_{t} \cap \mathbbm {1}\left( X_{k}>z\right) \) corresponding to the left child t left and right child t right of node t, respectively. For a node t in a tree \(T, N_{n}(t)=\left| \left\{ i \in \mathcal {D}^{(T)}: \mathbf {x}_{i} \in R_{t}\right\} \right| \) denotes the number of samples falling into \(R_{t}\) and
We define the set of inner nodes of a tree T as I(T).
1.2 A.2 Debiasing MDI via OOB Samples
In this section we give a short version of the proof that \(PG_{oob}^{(1,2)}\) is equivalent to the MDI-oob measure defined in [8]. For clarity we assume binary classification; Appendix A.2 contains an expanded version of the proof including the multi-class case. As elegantly demonstrated by [8], the MDI of feature k in a tree T can be written as
where \(\mathcal {D}^{(T)}\) is the bootstrapped or subsampled data set of the original data \(\mathcal {D}\). Since \(\sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) } = 0\), we can view MDI essentially as the sample covariance between \(f_{T, k}(x_i)\) and \(y_i\) on the bootstrapped dataset \(\mathcal {D}^{(T)}\).
MDI-oob is based on the usual variance reduction per node as shown in Eq. (34) (proof of Proposition (1)), but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):
We can, of course, rewrite the variance as
where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and the first sum is equal to the binomial variance \(p_{oob} \cdot (1- p_{oob})\). The final expression is effectively equal to \(PG_{oob}^{(1,2)}\).
Lastly, we now show that \(PG_{oob}^{(0.5,1)}\) is equivalent to the unbiased split-improvement measure defined in [26]. For the binary classificaton case, we can rewrite \(PG_{oob}^{(0.5,1)}\) as follows:
1.3 A.3 Variance Reduction View
Here, we provide a full version of the proof sketched in Sect. A.2 which leans heavily on the proof of Proposition (1) in [8].
We consider the usual variance reduction per node but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):
where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and we replace the squared deviations with the binomial variance \(p_{oob} \cdot (1- p_{oob} )\). The final expression is then
which, of course, is exactly the impurity reduction due to \(PG_{oob}^{(1,2)}\):
Another, somewhat surprising view of MDI is given by Eqs. (6) and (4), which for binary classification reads as:
and for the oob version:
1.4 A.4 \(\mathbf {E(\varDelta \widehat{PG}_{oob}^{(0)}) = 0}\)
The decrease in impurity (\(\varDelta G\)) for a parent node m is the weighted difference between the Gini importanceFootnote 4 \(G(m) = \hat{p}_m (1- \hat{p}_m )\) and those of its left and right children:
We assume that the node m splits on an uninformative variable \(X_j\), i.e. \(X_j\) and Y are independent.
We will use the short notation \(\sigma ^2_{m, .} \equiv p_{m,.} (1-p_{m,.})\) for . either equal to oob or in and rely on the following facts and notation:
-
1.
\(E[\hat{p}_{m, oob}] = p_{m,oob}\) is the “population” proportion of the class label in the OOB test data (of node m).
-
2.
\(E[\hat{p}_{m, in}] = p_{m,in}\) is the “population” proportion of the class label in the inbag test data (of node m).
-
3.
\(E[\hat{p}_{m, oob}] = E[\hat{p}_{m_l, oob}] = E[\hat{p}_{m_r, oob}] =p_{m,oob}\)
-
4.
\(E[\hat{p}_{m, oob}^2] = var(\hat{p}_{m, oob}) + E[\hat{p}_{m, oob}]^2 = \sigma ^2_{m, oob}/N_m + p_{m,oob}^2\)
\(\Rightarrow E[G_{oob}(m)] = E[\hat{p}_{m, oob}] - E[\hat{p}_{m, oob}^2] = \sigma ^2_{m, oob} \cdot \left( 1- \frac{1}{N_m}\right) \)
\(\Rightarrow E[\widehat{G}_{oob}(m)] = \sigma ^2_{m, oob}\)
-
5.
\(E[\hat{p}_{m, oob} \cdot \hat{p}_{m, in}] = E[\hat{p}_{m, oob}] \cdot E[\hat{p}_{m, in}] = p_{m,oob} \cdot p_{m,in}\)
Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of \(X_j\) and Y.
We now show that \(\mathbf {E(\varDelta PG_{oob}^{(0)}) \ne 0}\) We use the shorter notation \(G_{oob} = PG_{oob}^{(0)}\):
We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.
The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction, as outlined in the main text. We now show that the bias for this modified Gini impurity is zero for OOB data. As before, \(\widehat{G}_{oob} = \widehat{PG}_{oob}^{(0)}\):
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Loecher, M. (2022). Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-14463-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14462-2
Online ISBN: 978-3-031-14463-9
eBook Packages: Computer ScienceComputer Science (R0)