Abstract
This paper examines the reproducibility of learned explanations for black-box predictions via model distillation using classification trees. We find that common tree distillation methods fail to reproduce a single stable explanation when applied to the same teacher model due the randomness of the distillation process. We study this issue of reliable interpretation and propose a standardized framework for tree distillation to achieve reproducibility. The proposed framework consists of (1) a statistical test to stabilize tree splits, and (2) a stopping rule for tree building when using a teacher that provides an estimate of the uncertainty of its predictions, e.g. random forests. We demonstrate the empirical performance of the proposed distillation method on a variety of synthetic and real-world datasets.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availibility Statement
See Appendix B.1.
Code Availability
All empirical study code files are available at https://github.com/siriuz42/approximation_trees.git.
References
Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Learning certifiably optimal rule lists for categorical data. J Mach Learn Res 18(234):1–78
Athey S, Tibshirani J, Wager S (2016) Generalized random forests. arXiv preprint arXiv:1610.01271
Augasta MG, Kathirvalavakumar T (2012) Reverse engineering the neural networks for rule extraction in classification problems. Neural Process Lett 35(2):131–150
Banerjee M, McKeague IW et al (2007) Confidence sets for split points in decision trees. Ann Stat 35(2):543–574
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 66:289–300
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Shang N (1996) Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report 1(2):4
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton
Bucila C, Caruana R, Niculescu-Mizil A (2006) Model compression: making big, slow models practical. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD’06)
Buciluǎ C, Caruana R, Niculescu-Mizil A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 535–541
Chipman HA, George EI, McCulloch RE et al (2010) Bart: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298
Chouldechova A (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2):153–163
Craven MW, Shavlik JW (1995) Extracting tree-structured representations of trained networks. In: NIPS
Croz JJD, Higham NJ (1992) Stability of methods for matrix inversion. IMA J Numer Anal 12(1):1–19
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121
Ghosal I, Hooker G (2018) Boosting random forests to reduce bias; one-step boosted forest and its variance estimate. arXiv preprint arXiv:1803.08000
Gibbons RD, Hooker G, Finkelman MD, Weiss DJ, Pilkonis PA, Frank E, Moore T, Kupfer DJ (2013) The computerized adaptive diagnostic test for major depressive disorder (cad-mdd): a screening tool for depression. J Clin Psychiatry 74(7):1–478
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819
He H, Eisner J, Daume H (2012) Imitation learning by coaching. In: Advances in neural information processing systems, pp 3149–3157
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hu L, Chen J, Nair VN, Sudjianto A (2018) Locally interpretable models and effects based on supervised partitioning (LIME-SUP). ArXiv e-prints arXiv:1806.00663
Johansson U, Niklasson L (2009) Evolving decision trees using oracle guides. In: IEEE symposium on computational intelligence and data mining, 2009 (CIDM’09). IEEE, pp 238–244
Johansson U, Sönströd C, Löfström T (2010) Oracle coached decision trees and lists. In: International symposium on intelligent data analysis. Springer, pp 67–78
Johansson U, Sönströd C, Löfström T (2011) One tree to explain them all. In: 2011 IEEE congress on evolutionary computation (CEC). IEEE, pp 1444–1451
Krishnan R, Sivakumar G, Bhattacharya P (1999) Extracting decision trees from trained neural networks. Pattern Recognit 32(12):66
Last M, Maimon O, Minkov E (2002) Improving stability of decision trees. Int J Pattern Recognit Artif Intell 16(02):145–159
Li RH, Belford GG (2002) Instability of decision tree classification algorithms. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 570–575
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171
Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG (2019) A simple baseline for Bayesian uncertainty in deep learning. In: Advances in neural information processing systems, pp 13153–13164
Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43(4):570–577
Mentch L, Hooker G (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res 17(1):841–881
Mentch L, Hooker G (2017) Formal hypothesis tests for additive structure in random forests. J Comput Graph Stat 26(3):589–597
Mobahi H, Farajtabar M, Bartlett PL (2020) Self-distillation amplifies regularization in Hilbert space. arXiv preprint arXiv:2002.05715
Peng W, Coleman T, Mentch L (2019) Asymptotic distributions and rates of convergence for random forests via generalized u-statistics. arXiv preprint arXiv:1905.10651
Pham H, Dai Z, Xie Q, Le QV (2021) Meta pseudo labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11557–11568
Quinlan JR (1987) Generating production rules from decision trees. In: IJCAI, vol 87, pp 304–307. Citeseer
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier
Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1(5):206–215
Sexton J, Laake P (2009) Standard errors for bagged and random forest estimators. Comput Stat Data Anal 53(3):801–811
Strecht P (2015) A survey of merging decision trees data mining approaches. In: Proceedings of the 10th doctoral symposium in informatics engineering, pp 36–47
Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv J Law Technol 31:841
Wager S, Athey S (2017) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 6:66
Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15(1):1625–1651
Wang F, Rudin C (2015) Falling rule lists. In: Artificial intelligence and statistics, pp 1013–1022
Wolberg WH, Mangasarian OL (1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci 87(23):9193–9196
Yuan L, Tay FE, Li G, Wang T, Feng J (2020) Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3903–3911
Zhang J (1992) Selecting typical instances in instance-based learning. In: Machine learning proceedings. Elsevier, pp 470–479
Zhou ZH (2019) Ensemble methods: foundations and algorithms. Chapman and Hall/CRC
Zhou Y, Hooker G (2018) Boulevard: regularized stochastic gradient boosted trees and their limiting distribution. ArXiv e-prints
Zhou Z, Mentch L, Hooker G (2021) V-statistics and variance estimation. J Mach Learn Res 22(287):1–48
Acknowledgements
The authors gratefully acknowledge National Institutes of Health (NIH) R03DA036683, National Science Foundation (NFS) DMS-1053252 and NSF DEB-1353039. The authors would like to thank Robert Gibbons for providing the CAD-MDD data.
Funding
NIH R03DA036683, NSF DMS-1053252 and NSF DEB-1353039.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors of the paper have affiliations with Cornell University, the University of California, Berkeley, Google and Meta.
Additional information
Responsible editor: Martin Atzmueller, Johannes Fürnkranz, Tomáş Kliegr and Ute Schmid.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Limiting distribution of Gini index
Following the discussion, we now move to analyze a pseudo sample. Using similar notation, we denote the numbers of sample points and the probabilities over class labels in each child by, for \(p \in \{1,2\}, j \in \{1,\dots ,k\},\)
For simplicity we write, for \(p \in \{1,2\},q \in \{0,1\},\) \( N^p_q = ( n^p_q{\hat{\theta }}^p_{q,1}, \dots , n^p_q{\hat{\theta }}^p_{q,k} )^T. \) Employing a multivariate CLT we obtain
To relate this limiting distribution to the difference of Gini indices we shall employ the \(\delta \)-method. Consider the analytic function \(f: {\mathbb {R}}^{4k} \rightarrow {\mathbb {R}}\) s.t.
The \(\delta \)-method implies that
We should point out that while (9) provides us with the CLT we need to assess the difference between two Gini indices. After expanding (9),
or approximately,
Appendix B: Additional results on AppTree
1.1 Real datasets
In this section we will show the results of our method on eight datasets. Seven of them are available on the UCI repository (Lichman 2013) and one is the CAD-MDD data used by Gibbons et al. (2013). We manually split each dataset into train and test for cross validation. In the main body, Table 1 shows the number of covariates, training and testing sample size and the levels of responses for each dataset.
To decide the generative distribution of covariates before running our algorithm, we perturb the empirical distribution by Gaussian noise whose variances are approximately 1/50 of the ranges of corresponding covariates. The probability of jumping to a neighboring category for discrete covariates is set to be 1/7.
We compare across four methods here: classification trees (CART), random forests (RF), our proposed approximation tree (AppTree), and a baseline method (BASE). Previous work (Johansson and Niklasson 2009; Johansson et al. 2010) fixes the number of pseudo sample from the oracle during the coaching procedure. Analogously, we set BASE to be a non-adaptive version of our AppTree which requests a pseudo sample from the RF only once at the root node and uses it all the way down. The pseudo sample size is set to be 9 times the size of the training data; this is larger than the samples used in Johansson and Niklasson 2009; Johansson et al. 2010 and is designed as a reasonable blind decision without any prior information.
We use the same setting for all datasets for consistency. For each dataset, we train a RF containing 200 trees, a CART tree, then 100 AppTrees and 100 BASE trees approximating the RF. \(N_{ps} = 500,000\), which means each node of AppTree can generate at most \(5\times 10^5\) pseudo examples to decide its split. CART, BASE and AppTree all grow to the 6th layer including the root. Confidence level \(\alpha \) is set to be 0.1.
1.2 Binary classification
Figure 12 shows the evaluation of methods on binary classification datasets. Johansson et al. (2011) has pointed out that a single decision tree is already capable of mimicking an oracle predictor to make highly accurate predictions given the oracle is not overly complicated. Our simulation shows similar results, as all three methods CART, BASE and AppTree tightly follow the ROC curves of the RF and there is no significance difference among them. Fidelity is measured as the frequency of a model agreeing with the RF when predicting on same input covariates. We use a moving threshold as the classification boundary, and evaluate the fidelity on both the testing data and the new data (as marked “test" and “new" in the plot). First, all three 6-layer trees can approximate the RF with over 80% probability for almost any given classification threshold, and there is no significant difference among them. Further, the behaviors of both “test” and “new” plot seem similar, which lends support to our generative covariate distribution. While the overall 80% fidelity may seem not powerful enough to make those trees aligned with the oracle RF, we can build the trees deeper to better “overfit” the RF.
1.3 Multiclass classification
Figures 13 and 14 demonstrate evaluate performance on the three multiclass classification datasets. We observe similar patterns as we did in binary cases that all three tree methods work similarly. ROC curves of RF show poorer performance than in binary classification, and ROC curves of the three tree methods are further from random forests, suggesting that deeper trees might be necessary. Nonetheless, all tree methods again agree with the RF on about 80% of the predictions made by RF on the test data. We have therefore shown that our reproducibility requirement of AppTree does not undermine its predictive power and fidelity with the black box when compared with other tree methods.
1.4 Impact of \(N_{ps}\) on distillation accuracy and fidelity
Table 3 reflects the student’s accuracy and fidelity remains unchanged with respect to a changing \(N_{ps}\). This behavior is expected and guaranteed by the distillation practice. We anticipate \(N_{ps}\) in this case solely plays the role of a reproducibility hyperparameter.
1.5 Smoothing the sampler
Throughout we have assumed that \(F_X\)—the distribution for generating covariates during approximation tree construction—is fixed by the practitioner. This is to allow differing modeling goals to influence its choice; we might either want to ensure coverage of the whole space of possible feature values, or they may want to localize to a particular patient of interest, an approach not dissimilar to LIME (Ribeiro et al. 2016).
In most of our empirical studies, we have employed a Gaussian kernel smoother of the training data as a means of focusing our approximation near the feature values of interest while providing more coverage of the teacher’s behavior. This choice is subject to two potential critiques: the training data that we use to produce \(F_X\) is itself a source of uncertainty, or that simply specifying \(F_X\) to be the empirical distribution of the training data we can remove all variability associated with it.
Here we conduct an experiment that shows that using the Gaussian kernel smoothing both produces greater distillation fidelity to the teacher, and reduces the variability inherited from the training data distribution relative to simply using the training data. In particular, we use the following response functions
-
Setup 1: \(X \sim \text{ Unif }[-1, 1]^3\),
$$\begin{aligned} \text{ logit }P(y=1|x) = -0.5\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot \sin (\pi x_2 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$ -
Setup 2: \(X \sim \text{ Unif }[-1, 1]^5\),
$$\begin{aligned}{} & {} \text{ logit }P(y=1|x) = -0.5(1+x_3x_4)\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot (\sin (\pi x_2 / 2)\\ {}{} & {} \quad + \sin (\pi x_5 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$
We repeat 100 experiments in each of the two setting, where for each experiment we bootstrap a new sample of 1000 data points from a uniform distribution and either build a standard CART using the new sample and the teacher’s responses or apply the approximation tree algorithm with a kernel smoother using bandwidth \(\sigma = 0.2\). Figure 15 confirms that employing the smoother contributes to both reproducibility and faithfulness to the teacher model.
Comparison between bootstrap sampler and Gaussian kernel smoother in approximation trees when the distillation is given a bootstrap of the original sample distribution. LHS: unique tree structures on two simple synthetic settings. Kernel smoother outputs are much more stable. In each column, a single black bar represents a unique structure of the tree, while the height of the bar represents the number of occurrence of that structure out of 100 repetitions. RHS: prediction accuracy with respect to the teacher model
1.6 Reproducibility of approximation tree for neural networks
Most of our experiments are conducted with random forests, which comes with statistical estimates of sample variability to help guide the stopping rule. But the approximation tree can be used for any model types if we only focus on the first objective of ensuring that the distillation process reliably produces the same splits. In this part we present some results when the teacher model is a neural network.
We use Breast Cancer Data for demonstration purposes, see Sect. 5.4.2 for details. Here we build a neural network of one hidden layer of size 5. Figure 16 shows the reproducibility of AppTree from its top 4 layers with different \(N_{ps}\) values, which is the cap on the maximal number of pseudo examples AppTree can generate.
We observe a similar pattern as in Fig. 3: increasing Nps can improve reproducibility, but larger values than are often employed may be needed to make this satisfactory. Compared to random forests as teacher models (Fig. 9), neural networks produce slightly more stable BASE. This is because Breast Cancer Data is a relatively simple classification task for a neural network, which is able to produce class probabilities close to 0 or 1. Our approach can still improve reproducibility of the approximation, resulting in fewer distinct structures.
Reproducibility of AppTree with different \(N_{ps}\) values and a neural network teacher. The top 4 layers with \(N_{ps}=5{,}000\) and 50,000 of the trees are summarized respectively. In each column, a single black bar represents a unique structure of the tree, while the height of the bar represents the number of occurrence of that structure out of 100 repetitions
Appendix C: Approximation tree for CAD-MDD data
See Table 4 for full details of the approximation tree for CAD-MDD data.
Appendix D: Approximation tree for Breast Cancer Data
See Tables 5 and 6 for full details of the approximation tree for Breast Cancer Data. The data set used has 8 attributes and 699 samples.Footnote 2
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Zhou, Z. & Hooker, G. Approximation trees: statistical reproducibility in model distillation. Data Min Knowl Disc 38, 3308–3346 (2024). https://doi.org/10.1007/s10618-022-00907-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00907-3