We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the default variable-importance measure in random forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed SHAP values and suggest a remedy.
- 1.
Appendix A.1 contains expanded definitions and more thorough notation.
- 2.
A python library is available: https://github.com/slundberg/shap.
- 3.
In all random forest simulations, we choose \(mtry=2, ntrees=100\) and exclude rows with missing Age.
- 4.
For easier notation we have (i) left the multiplier 2 and (ii) omitted an index for the class membership.
A Appendix
A Appendix
1.1 A.1 Background and Notations
Definitions needed to understand Eq. (4). (The following paragraph closely follows the definitions in [8].)
Random Forests (RFs) are an ensemble of classification and regression trees, where each tree T defines a mapping from the feature space to the response. Trees are constructed independently of one another on a bootstrapped or subsampled data set \(\mathcal {D}^{(T)}\) of the original data \(\mathcal {D}\). Any node t in a tree T represents a subset (usually a hyper-rectangle) \(R_{t}\) of the feature space. A split of the node t is a pair (k, z) which divides the hyper-rectangle \(R_{t}\) into two hyper-rectangles \(R_{t} \cap \mathbbm {1}\left( X_{k} \le z\right) \) and \(R_{t} \cap \mathbbm {1}\left( X_{k}>z\right) \) corresponding to the left child t left and right child t right of node t, respectively. For a node t in a tree \(T, N_{n}(t)=\left| \left\{ i \in \mathcal {D}^{(T)}: \mathbf {x}_{i} \in R_{t}\right\} \right| \) denotes the number of samples falling into \(R_{t}\) and
We define the set of inner nodes of a tree T as I(T).
1.2 A.2 Debiasing MDI via OOB Samples
In this section we give a short version of the proof that \(PG_{oob}^{(1,2)}\) is equivalent to the MDI-oob measure defined in [8]. For clarity we assume binary classification; Appendix A.2 contains an expanded version of the proof including the multi-class case. As elegantly demonstrated by [8], the MDI of feature k in a tree T can be written as
where \(\mathcal {D}^{(T)}\) is the bootstrapped or subsampled data set of the original data \(\mathcal {D}\). Since \(\sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) } = 0\), we can view MDI essentially as the sample covariance between \(f_{T, k}(x_i)\) and \(y_i\) on the bootstrapped dataset \(\mathcal {D}^{(T)}\).
MDI-oob is based on the usual variance reduction per node as shown in Eq. (34) (proof of Proposition (1)), but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):
We can, of course, rewrite the variance as
where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and the first sum is equal to the binomial variance \(p_{oob} \cdot (1- p_{oob})\). The final expression is effectively equal to \(PG_{oob}^{(1,2)}\).
Lastly, we now show that \(PG_{oob}^{(0.5,1)}\) is equivalent to the unbiased split-improvement measure defined in [26]. For the binary classificaton case, we can rewrite \(PG_{oob}^{(0.5,1)}\) as follows:
1.3 A.3 Variance Reduction View
Here, we provide a full version of the proof sketched in Sect. A.2 which leans heavily on the proof of Proposition (1) in [8].
We consider the usual variance reduction per node but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):
where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and we replace the squared deviations with the binomial variance \(p_{oob} \cdot (1- p_{oob} )\). The final expression is then
which, of course, is exactly the impurity reduction due to \(PG_{oob}^{(1,2)}\):
Another, somewhat surprising view of MDI is given by Eqs. (6) and (4), which for binary classification reads as:
and for the oob version:
1.4 A.4 \(\mathbf {E(\varDelta \widehat{PG}_{oob}^{(0)}) = 0}\)
The decrease in impurity (\(\varDelta G\)) for a parent node m is the weighted difference between the Gini importanceFootnote 4 \(G(m) = \hat{p}_m (1- \hat{p}_m )\) and those of its left and right children:
We assume that the node m splits on an uninformative variable \(X_j\), i.e. \(X_j\) and Y are independent.
We will use the short notation \(\sigma ^2_{m, .} \equiv p_{m,.} (1-p_{m,.})\) for . either equal to oob or in and rely on the following facts and notation:
\(E[\hat{p}_{m, oob}] = p_{m,oob}\) is the “population” proportion of the class label in the OOB test data (of node m).
\(E[\hat{p}_{m, in}] = p_{m,in}\) is the “population” proportion of the class label in the inbag test data (of node m).
\(E[\hat{p}_{m, oob}] = E[\hat{p}_{m_l, oob}] = E[\hat{p}_{m_r, oob}] =p_{m,oob}\)
\(E[\hat{p}_{m, oob}^2] = var(\hat{p}_{m, oob}) + E[\hat{p}_{m, oob}]^2 = \sigma ^2_{m, oob}/N_m + p_{m,oob}^2\)
\(\Rightarrow E[G_{oob}(m)] = E[\hat{p}_{m, oob}] - E[\hat{p}_{m, oob}^2] = \sigma ^2_{m, oob} \cdot \left( 1- \frac{1}{N_m}\right) \)
\(\Rightarrow E[\widehat{G}_{oob}(m)] = \sigma ^2_{m, oob}\)
\(E[\hat{p}_{m, oob} \cdot \hat{p}_{m, in}] = E[\hat{p}_{m, oob}] \cdot E[\hat{p}_{m, in}] = p_{m,oob} \cdot p_{m,in}\)
Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of \(X_j\) and Y.
We now show that \(\mathbf {E(\varDelta PG_{oob}^{(0)}) \ne 0}\) We use the shorter notation \(G_{oob} = PG_{oob}^{(0)}\):
We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.
The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction, as outlined in the main text. We now show that the bias for this modified Gini impurity is zero for OOB data. As before, \(\widehat{G}_{oob} = \widehat{PG}_{oob}^{(0)}\):
