Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles

Loecher, Markus

doi:10.1007/978-3-031-14463-9_8

Markus Loecher ORCID: orcid.org/0000-0002-6823-1994¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13480))

Included in the following conference series:

International Cross-Domain Conference for Machine Learning and Knowledge Extraction

1484 Accesses
5 Citations

Abstract

We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the default variable-importance measure in random forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed SHAP values and suggest a remedy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Are SHAP Values Biased Towards High-Entropy Features?

Random Forests with a Steepend Gini-Index Split Function and Feature Coherence Injection

Construction of Artificial Most Representative Trees by Minimizing Tree-Based Distance Measures

Notes

1.
Appendix A.1 contains expanded definitions and more thorough notation.
2.
A python library is available: https://github.com/slundberg/shap.
3.
In all random forest simulations, we choose $mtry=2, ntrees=100$ and exclude rows with missing Age.
4.
For easier notation we have (i) left the multiplier 2 and (ii) omitted an index for the class membership.

References

Adler, A.I., Painsky, A.: Feature importance in gradient boosting trees with cross-validation feature selection. Entropy 24(5), 687 (2022)
Article MathSciNet Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(1), 3 (2006)
Article Google Scholar
Grömping, U.: Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63(4), 308–319 (2009)
Article MathSciNet Google Scholar
Grömping, U.: Variable importance in regression models. Wiley Interdiscip. Rev. Comput. Stat. 7(2), 137–152 (2015)
Article MathSciNet Google Scholar
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Article MathSciNet Google Scholar
Kim, H., Loh, W.Y.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96(454), 589–604 (2001)
Article MathSciNet Google Scholar
Li, X., Wang, Y., Basu, S., Kumbier, K., Yu, B.: A debiased mdi feature importance measure for random forests. In: Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8049–8059 (2019)
Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002). https://CRAN.R-project.org/doc/Rnews/
Loecher, M.: Unbiased variable importance for random forests. Commun. Stat. Theory Methods 51, 1–13 (2020)
MathSciNet MATH Google Scholar
Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Stat. Sin. 7, 815–840 (1997)
MathSciNet MATH Google Scholar
Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020)
Article Google Scholar
Menze, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.A.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)
Article Google Scholar
Nembrini, S., König, I.R., Wright, M.N.: The revival of the Gini importance? Bioinformatics 34(21), 3711–3718 (2018)
Article Google Scholar
Olson, R.S., Cava, W.L., Mustahsan, Z., Varik, A., Moore, J.H.: Data-driven advice for applying machine learning to bioinformatics problems. In: Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium, pp. 192–203. World Scientific (2018)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Saabas, A.: Interpreting random forests (2019). http://blog.datadive.net/interpreting-random-forests/
Saabas, A.: Treeinterpreter library (2019). https://github.com/andosa/treeinterpreter
Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 611–628 (2008)
Article MathSciNet Google Scholar
Shih, Y.S.: A note on split selection bias in classification trees. Comput. Stat. Data Anal. 45(3), 457–466 (2004)
Article MathSciNet Google Scholar
Shih, Y.S., Tsai, H.W.: Variable selection bias in regression trees with constant fits. Comput. Stat. Data Anal. 45(3), 595–607 (2004)
Article MathSciNet Google Scholar
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8, 1–21 (2007). https://doi.org/10.1186/1471-2105-8-25
Article Google Scholar
Strobl, C., Boulesteix, A.L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. 52(1), 483–501 (2007)
Article MathSciNet Google Scholar
Sun, Q.: tree. interpreter: Random Forest Prediction Decomposition and Feature Importance Measure (2020). https://CRAN.R-project.org/package=tree.interpreter. R package version 0.1.1
Wright, M.N., Ziegler, A.: ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77(1), 1–17 (2017). https://doi.org/10.18637/jss.v077.i01
Zhou, Z., Hooker, G.: Unbiased measurement of feature importance in tree-based methods. ACM Trans. Knowl. Discov. Data (TKDD) 15(2), 1–21 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Berlin School of Economics and Law, 10825, Berlin, Germany
Markus Loecher

Authors

Markus Loecher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Loecher .

Editor information

Editors and Affiliations

University of Natural Resources and Life Sciences Vienna, Vienna, Austria
Andreas Holzinger
St. Pölten University of Applied Sciences, St. Pölten, Austria
Peter Kieseberg
TU Wien, Vienna, Austria
A Min Tjoa
SBA Research, Vienna, Austria
Edgar Weippl

A Appendix

1.1 A.1 Background and Notations

Definitions needed to understand Eq. (4). (The following paragraph closely follows the definitions in [8].)

Random Forests (RFs) are an ensemble of classification and regression trees, where each tree T defines a mapping from the feature space to the response. Trees are constructed independently of one another on a bootstrapped or subsampled data set $\mathcal {D}^{(T)}$ of the original data $\mathcal {D}$. Any node t in a tree T represents a subset (usually a hyper-rectangle) $R_{t}$ of the feature space. A split of the node t is a pair (k, z) which divides the hyper-rectangle $R_{t}$ into two hyper-rectangles $R_{t} \cap \mathbbm {1}\left( X_{k} \le z\right) $ and $R_{t} \cap \mathbbm {1}\left( X_{k}>z\right) $ corresponding to the left child t left and right child t right of node t, respectively. For a node t in a tree $T, N_{n}(t)=\left| \left\{ i \in \mathcal {D}^{(T)}: \mathbf {x}_{i} \in R_{t}\right\} \right| $ denotes the number of samples falling into $R_{t}$ and

$$ \mu _{n}(t):=\frac{1}{N_{n}(t)} \sum _{i: \mathbf {x}_{i} \in R_{t}} y_{i} $$

We define the set of inner nodes of a tree T as I(T).

1.2 A.2 Debiasing MDI via OOB Samples

In this section we give a short version of the proof that $PG_{oob}^{(1,2)}$ is equivalent to the MDI-oob measure defined in [8]. For clarity we assume binary classification; Appendix A.2 contains an expanded version of the proof including the multi-class case. As elegantly demonstrated by [8], the MDI of feature k in a tree T can be written as

$$\begin{aligned} MDI = \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) \cdot y_i} \end{aligned}$$

(6)

where $\mathcal {D}^{(T)}$ is the bootstrapped or subsampled data set of the original data $\mathcal {D}$. Since $\sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) } = 0$, we can view MDI essentially as the sample covariance between $f_{T, k}(x_i)$ and $y_i$ on the bootstrapped dataset $\mathcal {D}^{(T)}$.

MDI-oob is based on the usual variance reduction per node as shown in Eq. (34) (proof of Proposition (1)), but with a “variance” defined as the mean squared deviations of $y_{oob}$ from the inbag mean $\mu _{in}$:

$$ \varDelta _{I}(t) = \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,in})^2} 1 (x_i \in R_t) - \ldots $$

We can, of course, rewrite the variance as

$$\begin{aligned} \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,in})^2}&= \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,oob})^2} + (\mu _{n,in} - \mu _{n,oob})^2 \end{aligned}$$

(7)

$$\begin{aligned}&= p_{oob} \cdot (1- p_{oob} ) + (p_{in} - p_{oob})^2 \end{aligned}$$

(8)

where the last equality is for Bernoulli $y_i$, in which case the means $\mu _{in/oob}$ become proportions $p_{in/oob}$ and the first sum is equal to the binomial variance $p_{oob} \cdot (1- p_{oob})$. The final expression is effectively equal to $PG_{oob}^{(1,2)}$.

Lastly, we now show that $PG_{oob}^{(0.5,1)}$ is equivalent to the unbiased split-improvement measure defined in [26]. For the binary classificaton case, we can rewrite $PG_{oob}^{(0.5,1)}$ as follows:

$$\begin{aligned} PG_{oob}^{(0.5,1)}&= \frac{1}{2} \cdot \sum _{d=1}^D{ \hat{p}_{d,oob} \cdot \left( 1- \hat{p}_{d,oob} \right) + \hat{p}_{d,in} \cdot \left( 1- \hat{p}_{d,in} \right) + (\hat{p}_{d,oob} - \hat{p}_{d,in})^2} \end{aligned}$$

(9)

$$\begin{aligned}&= \hat{p}_{oob} \cdot \left( 1- \hat{p}_{oob} \right) + \hat{p}_{in} \cdot \left( 1- \hat{p}_{in} \right) + (\hat{p}_{oob} - \hat{p}_{in})^2 \end{aligned}$$

(10)

$$\begin{aligned}&= \hat{p}_{oob} - \hat{p}_{oob}^2 + \hat{p}_{in} - \hat{p}_{in}^2 + \hat{p}_{oob}^2 - 2 \hat{p}_{oob} \cdot \hat{p}_{in} + \hat{p}_{in}^2 \end{aligned}$$

(11)

$$\begin{aligned}&= \hat{p}_{oob} + \hat{p}_{in} - 2 \hat{p}_{oob} \cdot \hat{p}_{in} \end{aligned}$$

(12)

1.3 A.3 Variance Reduction View

Here, we provide a full version of the proof sketched in Sect. A.2 which leans heavily on the proof of Proposition (1) in [8].

We consider the usual variance reduction per node but with a “variance” defined as the mean squared deviations of $y_{oob}$ from the inbag mean $\mu _{in}$:

(13)

(14)

where the last equality is for Bernoulli $y_i$, in which case the means $\mu _{in/oob}$ become proportions $p_{in/oob}$ and we replace the squared deviations with the binomial variance $p_{oob} \cdot (1- p_{oob} )$. The final expression is then

$$\begin{aligned} \begin{aligned} \varDelta _{\mathcal {I}}(t)&=p_{oob}(t) \cdot \left( 1- p_{oob}(t) \right) + \left[ p_{oob}(t) - p_{in}(t)\right] ^2\\&- \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} \left( p_{oob}(t^{\text {left}}) \cdot \left( 1- p_{oob}(t^{\text {left}}) \right) + \left[ p_{oob}(t^{\text {left}}) - p_{in}(t^{\text {left}})\right] ^2 \right) \\&- \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} \left( p_{oob}(t^{\text {right}}) \cdot \left( 1- p_{oob}(t^{\text {right}}) \right) + \left[ p_{oob}(t^{\text {right}}) - p_{in}(t^{\text {right}})\right] ^2 \right) \end{aligned} \end{aligned}$$

(15)

which, of course, is exactly the impurity reduction due to $PG_{oob}^{(1,2)}$:

$$\begin{aligned} \varDelta _{\mathcal {I}}(t) =PG_{oob}^{(1,2)}(t) - \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} PG_{oob}^{(1,2)}(t^{\text {left}}) - \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} PG_{oob}^{(1,2)}(t^{\text {right}}) \end{aligned}$$

(16)

Another, somewhat surprising view of MDI is given by Eqs. (6) and (4), which for binary classification reads as:

$$\begin{aligned} \begin{aligned} MDI&= \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{t \in I(T): v(t)=k} \sum _{i \in \mathcal {D}^{(T)}} \left[ \mu _{n}\left( t^{left}\right) \mathbbm {1}\left( x_i \in R_{t^{left}}\right) \right. \\&\quad \left. +\,\mu _{n}\left( t^{right}\right) \mathbbm {1}\left( x_i \in R_{t^{right}}\right) -\mu _{n}(t) \mathbbm {1}\left( x_ \in R_{t}\right) \right] \cdot y_i \\&= \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{t \in I(T): v(t)=k}{ - p_{in}(t)^2 + \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} p_{in}(t^{\text {left}})^2 + \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} p_{in}(t^{\text {right}})^2 } \end{aligned} \end{aligned}$$

(17)

and for the oob version:

$$\begin{aligned} MDI_{oob} = - p_{in}(t) \cdot p_{oob}(t) + \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} p_{in}(t^{\text {left}}) \cdot p_{oob}(t^{\text {left}}) + \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} p_{in}(t^{\text {right}}) \cdot p_{oob}(t^{\text {right}}) \end{aligned}$$

(18)

1.4 A.4 $\mathbf {E(\varDelta \widehat{PG}_{oob}^{(0)}) = 0}$

The decrease in impurity ($\varDelta G$) for a parent node m is the weighted difference between the Gini importance^{Footnote 4} $G(m) = \hat{p}_m (1- \hat{p}_m )$ and those of its left and right children:

$$ \varDelta G(m) = G(m) - \left[ N_{m_l} G(m_l) - N_{m_r} G(m_r) \right] / N_m $$

We assume that the node m splits on an uninformative variable $X_j$, i.e. $X_j$ and Y are independent.

We will use the short notation $\sigma ^2_{m, .} \equiv p_{m,.} (1-p_{m,.})$ for . either equal to oob or in and rely on the following facts and notation:

1.
$E[\hat{p}_{m, oob}] = p_{m,oob}$ is the “population” proportion of the class label in the OOB test data (of node m).
2.
$E[\hat{p}_{m, in}] = p_{m,in}$ is the “population” proportion of the class label in the inbag test data (of node m).
3.
$E[\hat{p}_{m, oob}] = E[\hat{p}_{m_l, oob}] = E[\hat{p}_{m_r, oob}] =p_{m,oob}$
4.
$E[\hat{p}_{m, oob}^2] = var(\hat{p}_{m, oob}) + E[\hat{p}_{m, oob}]^2 = \sigma ^2_{m, oob}/N_m + p_{m,oob}^2$

$\Rightarrow E[G_{oob}(m)] = E[\hat{p}_{m, oob}] - E[\hat{p}_{m, oob}^2] = \sigma ^2_{m, oob} \cdot \left( 1- \frac{1}{N_m}\right) $

$\Rightarrow E[\widehat{G}_{oob}(m)] = \sigma ^2_{m, oob}$
5.
$E[\hat{p}_{m, oob} \cdot \hat{p}_{m, in}] = E[\hat{p}_{m, oob}] \cdot E[\hat{p}_{m, in}] = p_{m,oob} \cdot p_{m,in}$

Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of $X_j$ and Y.

We now show that $\mathbf {E(\varDelta PG_{oob}^{(0)}) \ne 0}$ We use the shorter notation $G_{oob} = PG_{oob}^{(0)}$:

$$\begin{aligned} E[\varDelta G_{oob}(m)]&= E[G_{oob}(m)] - \frac{N_{m_l}}{N_{m}} E[G_{oob}(m_l)] - \frac{N_{m_r}}{N_{m}} E[G_{oob}(m_r)] \\&= \sigma ^2_{m,oob} \cdot \left[ 1- \frac{1}{N_m} - \frac{N_{m_l}}{N_{m}} \left( 1- \frac{1}{N_{m_l}}\right) - \frac{N_{m_r}}{N_{m}} \left( 1- \frac{1}{N_{m_r}}\right) \right] \\&= \sigma ^2_{m,oob} \cdot \left[ 1- \frac{1}{N_m} - \frac{N_{m_l} + N_{m_r}}{N_{m}} + \frac{2}{N_m} \right] = \frac{\sigma ^2_{m,oob}}{N_m} \end{aligned}$$

We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.

The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction, as outlined in the main text. We now show that the bias for this modified Gini impurity is zero for OOB data. As before, $\widehat{G}_{oob} = \widehat{PG}_{oob}^{(0)}$:

$$\begin{aligned} E[\varDelta \widehat{PG}_{oob}(m)]&= E[\widehat{G}_{oob}(m)] - \frac{N_{m_l}}{N_{m}} E[\widehat{G}_{oob}(m_l)] - \frac{N_{m_r}}{N_{m}} E[\widehat{G}_{oob}(m_r)] \\&= \sigma ^2_{m,oob} \cdot \left[ 1 - \frac{N_{m_l} + N_{m_r}}{N_{m}} \right] = 0 \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Loecher, M. (2022). Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-14463-9_8
Published: 11 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14462-2
Online ISBN: 978-3-031-14463-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles