Approximation trees: statistical reproducibility in model distillation

Zhou, Yichen; Zhou, Zhengze; Hooker, Giles

doi:10.1007/s10618-022-00907-3

Approximation trees: statistical reproducibility in model distillation

Published: 11 January 2023

Volume 38, pages 3308–3346, (2024)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

472 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

This paper examines the reproducibility of learned explanations for black-box predictions via model distillation using classification trees. We find that common tree distillation methods fail to reproduce a single stable explanation when applied to the same teacher model due the randomness of the distillation process. We study this issue of reliable interpretation and propose a standardized framework for tree distillation to achieve reproducibility. The proposed framework consists of (1) a statistical test to stabilize tree splits, and (2) a stopping rule for tree building when using a teacher that provides an estimate of the uncertainty of its predictions, e.g. random forests. We demonstrate the empirical performance of the proposed distillation method on a variety of synthetic and real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximation of SHAP Values for Randomized Tree Ensembles

CHIRPS: Explaining random forest classification

Article Open access 04 June 2020

Example-Based Explanations of Random Forest Predictions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availibility Statement

See Appendix B.1.

Code Availability

All empirical study code files are available at https://github.com/siriuz42/approximation_trees.git.

Notes

References

Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Learning certifiably optimal rule lists for categorical data. J Mach Learn Res 18(234):1–78
MathSciNet Google Scholar
Athey S, Tibshirani J, Wager S (2016) Generalized random forests. arXiv preprint arXiv:1610.01271
Augasta MG, Kathirvalavakumar T (2012) Reverse engineering the neural networks for rule extraction in classification problems. Neural Process Lett 35(2):131–150
Article Google Scholar
Banerjee M, McKeague IW et al (2007) Confidence sets for split points in decision trees. Ann Stat 35(2):543–574
Article MathSciNet Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 66:289–300
Article MathSciNet Google Scholar
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Shang N (1996) Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report 1(2):4
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton
Google Scholar
Bucila C, Caruana R, Niculescu-Mizil A (2006) Model compression: making big, slow models practical. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD’06)
Buciluǎ C, Caruana R, Niculescu-Mizil A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 535–541
Chipman HA, George EI, McCulloch RE et al (2010) Bart: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298
Article MathSciNet Google Scholar
Chouldechova A (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2):153–163
Article Google Scholar
Craven MW, Shavlik JW (1995) Extracting tree-structured representations of trained networks. In: NIPS
Croz JJD, Higham NJ (1992) Stability of methods for matrix inversion. IMA J Numer Anal 12(1):1–19
Article MathSciNet Google Scholar
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121
Article Google Scholar
Ghosal I, Hooker G (2018) Boosting random forests to reduce bias; one-step boosted forest and its variance estimate. arXiv preprint arXiv:1803.08000
Gibbons RD, Hooker G, Finkelman MD, Weiss DJ, Pilkonis PA, Frank E, Moore T, Kupfer DJ (2013) The computerized adaptive diagnostic test for major depressive disorder (cad-mdd): a screening tool for depression. J Clin Psychiatry 74(7):1–478
Article Google Scholar
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819
Article Google Scholar
He H, Eisner J, Daume H (2012) Imitation learning by coaching. In: Advances in neural information processing systems, pp 3149–3157
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hu L, Chen J, Nair VN, Sudjianto A (2018) Locally interpretable models and effects based on supervised partitioning (LIME-SUP). ArXiv e-prints arXiv:1806.00663
Johansson U, Niklasson L (2009) Evolving decision trees using oracle guides. In: IEEE symposium on computational intelligence and data mining, 2009 (CIDM’09). IEEE, pp 238–244
Johansson U, Sönströd C, Löfström T (2010) Oracle coached decision trees and lists. In: International symposium on intelligent data analysis. Springer, pp 67–78
Johansson U, Sönströd C, Löfström T (2011) One tree to explain them all. In: 2011 IEEE congress on evolutionary computation (CEC). IEEE, pp 1444–1451
Krishnan R, Sivakumar G, Bhattacharya P (1999) Extracting decision trees from trained neural networks. Pattern Recognit 32(12):66
Article Google Scholar
Last M, Maimon O, Minkov E (2002) Improving stability of decision trees. Int J Pattern Recognit Artif Intell 16(02):145–159
Article Google Scholar
Li RH, Belford GG (2002) Instability of decision tree classification algorithms. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 570–575
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171
Article Google Scholar
Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG (2019) A simple baseline for Bayesian uncertainty in deep learning. In: Advances in neural information processing systems, pp 13153–13164
Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43(4):570–577
Article MathSciNet Google Scholar
Mentch L, Hooker G (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res 17(1):841–881
MathSciNet Google Scholar
Mentch L, Hooker G (2017) Formal hypothesis tests for additive structure in random forests. J Comput Graph Stat 26(3):589–597
Article MathSciNet Google Scholar
Mobahi H, Farajtabar M, Bartlett PL (2020) Self-distillation amplifies regularization in Hilbert space. arXiv preprint arXiv:2002.05715
Peng W, Coleman T, Mentch L (2019) Asymptotic distributions and rates of convergence for random forests via generalized u-statistics. arXiv preprint arXiv:1905.10651
Pham H, Dai Z, Xie Q, Le QV (2021) Meta pseudo labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11557–11568
Quinlan JR (1987) Generating production rules from decision trees. In: IJCAI, vol 87, pp 304–307. Citeseer
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier
Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1(5):206–215
Article Google Scholar
Sexton J, Laake P (2009) Standard errors for bagged and random forest estimators. Comput Stat Data Anal 53(3):801–811
Article MathSciNet Google Scholar
Strecht P (2015) A survey of merging decision trees data mining approaches. In: Proceedings of the 10th doctoral symposium in informatics engineering, pp 36–47
Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv J Law Technol 31:841
Wager S, Athey S (2017) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 6:66
Google Scholar
Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15(1):1625–1651
MathSciNet Google Scholar
Wang F, Rudin C (2015) Falling rule lists. In: Artificial intelligence and statistics, pp 1013–1022
Wolberg WH, Mangasarian OL (1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci 87(23):9193–9196
Article Google Scholar
Yuan L, Tay FE, Li G, Wang T, Feng J (2020) Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3903–3911
Zhang J (1992) Selecting typical instances in instance-based learning. In: Machine learning proceedings. Elsevier, pp 470–479
Zhou ZH (2019) Ensemble methods: foundations and algorithms. Chapman and Hall/CRC
Zhou Y, Hooker G (2018) Boulevard: regularized stochastic gradient boosted trees and their limiting distribution. ArXiv e-prints
Zhou Z, Mentch L, Hooker G (2021) V-statistics and variance estimation. J Mach Learn Res 22(287):1–48
MathSciNet Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge National Institutes of Health (NIH) R03DA036683, National Science Foundation (NFS) DMS-1053252 and NSF DEB-1353039. The authors would like to thank Robert Gibbons for providing the CAD-MDD data.

Funding

NIH R03DA036683, NSF DMS-1053252 and NSF DEB-1353039.

Author information

Authors and Affiliations

Department of Statistics and Data Science, Cornell University, Ithaca, USA
Yichen Zhou & Zhengze Zhou
Department of Statistics, University of California, Berkeley, Berkeley, USA
Giles Hooker

Authors

Yichen Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Zhengze Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Giles Hooker
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhengze Zhou.

Ethics declarations

Conflicts of interest

The authors of the paper have affiliations with Cornell University, the University of California, Berkeley, Google and Meta.

Additional information

Responsible editor: Martin Atzmueller, Johannes Fürnkranz, Tomáş Kliegr and Ute Schmid.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Limiting distribution of Gini index

Following the discussion, we now move to analyze a pseudo sample. Using similar notation, we denote the numbers of sample points and the probabilities over class labels in each child by, for $p \in \{1,2\}, j \in \{1,\dots ,k\},$

$$\begin{aligned} n^p_0 = \sum _{i=1}^n \mathbb {1}_{\{G_p(X_i)=0\}}, \quad {\hat{\theta }}^p_{0,j} = \frac{1}{n^p_0} \sum _{i=1}^n Y_i^j\cdot \mathbb {1}_{\{G_p(X_i)=0\}}, \\ n^p_1 = \sum _{i=1}^n \mathbb {1}_{\{G_p(X_i)=1\}}, \quad {\hat{\theta }}^p_{1,j} = \frac{1}{n^p_1} \sum _{i=1}^n Y_i^j\cdot \mathbb {1}_{\{G_p(X_i)=1\}}. \end{aligned}$$

For simplicity we write, for $p \in \{1,2\},q \in \{0,1\},$ $ N^p_q = ( n^p_q{\hat{\theta }}^p_{q,1}, \dots , n^p_q{\hat{\theta }}^p_{q,k} )^T. $ Employing a multivariate CLT we obtain

$$\begin{aligned} \sqrt{n} \left( \frac{1}{n}\begin{bmatrix} N^1_0 \\ N^1_1 \\ N^2_0 \\ N^2_1 \end{bmatrix} - \begin{bmatrix} \varTheta ^1_0 \\ \varTheta ^1_1 \\ \varTheta ^2_0 \\ \varTheta ^2_1 \end{bmatrix} \right) \longrightarrow N(0, \varSigma ). \end{aligned}$$

To relate this limiting distribution to the difference of Gini indices we shall employ the $\delta $-method. Consider the analytic function $f: {\mathbb {R}}^{4k} \rightarrow {\mathbb {R}}$ s.t.

$$\begin{aligned} f(x_1, \dots , x_{4k}) =&-\frac{1}{\pi ^1_0}\sum _{i=1}^k x_i^2 -\frac{1}{\pi ^1_1} \sum _{i=k+1}^{2k} x_i^2 \\&+\frac{1}{\pi ^2_0} \sum _{i=2k+1}^{3k} x_i^2 +\frac{1}{\pi ^2_1} \sum _{i=3k+1}^{4k} x_i^2. \end{aligned}$$

The $\delta $-method implies that

$$\begin{aligned} \sqrt{n} \left( f\left( \frac{1}{n}\begin{bmatrix} N^1_0 \\ N^1_1 \\ N^2_0 \\ N^2_1 \end{bmatrix}\right) - f\left( \begin{bmatrix} \varTheta ^1_0 \\ \varTheta ^1_1 \\ \varTheta ^2_0 \\ \varTheta ^2_1 \end{bmatrix}\right) \right) \longrightarrow N(0, \varTheta ^T\varSigma \varTheta ). \end{aligned}$$

(9)

We should point out that while (9) provides us with the CLT we need to assess the difference between two Gini indices. After expanding (9),

$$\begin{aligned} \sqrt{n} \left( ({\hat{g}}_{1,n}-{\hat{g}}_{2,n}) - (g_1 - g_2)\right) \longrightarrow N(0, \varTheta ^T \varSigma \varTheta ). \end{aligned}$$

or approximately,

$$\begin{aligned} ({\hat{g}}_{1,n} - {\hat{g}}_{2,n}) - (g_1 - g_2) \sim N \left( 0, \frac{\varTheta ^T \varSigma \varTheta }{n}\right) . \end{aligned}$$

Appendix B: Additional results on AppTree

1.1 Real datasets

In this section we will show the results of our method on eight datasets. Seven of them are available on the UCI repository (Lichman 2013) and one is the CAD-MDD data used by Gibbons et al. (2013). We manually split each dataset into train and test for cross validation. In the main body, Table 1 shows the number of covariates, training and testing sample size and the levels of responses for each dataset.

To decide the generative distribution of covariates before running our algorithm, we perturb the empirical distribution by Gaussian noise whose variances are approximately 1/50 of the ranges of corresponding covariates. The probability of jumping to a neighboring category for discrete covariates is set to be 1/7.

We compare across four methods here: classification trees (CART), random forests (RF), our proposed approximation tree (AppTree), and a baseline method (BASE). Previous work (Johansson and Niklasson 2009; Johansson et al. 2010) fixes the number of pseudo sample from the oracle during the coaching procedure. Analogously, we set BASE to be a non-adaptive version of our AppTree which requests a pseudo sample from the RF only once at the root node and uses it all the way down. The pseudo sample size is set to be 9 times the size of the training data; this is larger than the samples used in Johansson and Niklasson 2009; Johansson et al. 2010 and is designed as a reasonable blind decision without any prior information.

We use the same setting for all datasets for consistency. For each dataset, we train a RF containing 200 trees, a CART tree, then 100 AppTrees and 100 BASE trees approximating the RF. $N_{ps} = 500,000$, which means each node of AppTree can generate at most $5\times 10^5$ pseudo examples to decide its split. CART, BASE and AppTree all grow to the 6th layer including the root. Confidence level $\alpha $ is set to be 0.1.

1.2 Binary classification

Figure 12 shows the evaluation of methods on binary classification datasets. Johansson et al. (2011) has pointed out that a single decision tree is already capable of mimicking an oracle predictor to make highly accurate predictions given the oracle is not overly complicated. Our simulation shows similar results, as all three methods CART, BASE and AppTree tightly follow the ROC curves of the RF and there is no significance difference among them. Fidelity is measured as the frequency of a model agreeing with the RF when predicting on same input covariates. We use a moving threshold as the classification boundary, and evaluate the fidelity on both the testing data and the new data (as marked “test" and “new" in the plot). First, all three 6-layer trees can approximate the RF with over 80% probability for almost any given classification threshold, and there is no significant difference among them. Further, the behaviors of both “test” and “new” plot seem similar, which lends support to our generative covariate distribution. While the overall 80% fidelity may seem not powerful enough to make those trees aligned with the oracle RF, we can build the trees deeper to better “overfit” the RF.

1.3 Multiclass classification

Figures 13 and 14 demonstrate evaluate performance on the three multiclass classification datasets. We observe similar patterns as we did in binary cases that all three tree methods work similarly. ROC curves of RF show poorer performance than in binary classification, and ROC curves of the three tree methods are further from random forests, suggesting that deeper trees might be necessary. Nonetheless, all tree methods again agree with the RF on about 80% of the predictions made by RF on the test data. We have therefore shown that our reproducibility requirement of AppTree does not undermine its predictive power and fidelity with the black box when compared with other tree methods.

1.4 Impact of $N_{ps}$ on distillation accuracy and fidelity

Table 3 Student accuracy and fidelity metrics on the real world datasets with varying $N_{ps}$

Full size table

Table 3 reflects the student’s accuracy and fidelity remains unchanged with respect to a changing $N_{ps}$. This behavior is expected and guaranteed by the distillation practice. We anticipate $N_{ps}$ in this case solely plays the role of a reproducibility hyperparameter.

1.5 Smoothing the sampler

Throughout we have assumed that $F_X$—the distribution for generating covariates during approximation tree construction—is fixed by the practitioner. This is to allow differing modeling goals to influence its choice; we might either want to ensure coverage of the whole space of possible feature values, or they may want to localize to a particular patient of interest, an approach not dissimilar to LIME (Ribeiro et al. 2016).

In most of our empirical studies, we have employed a Gaussian kernel smoother of the training data as a means of focusing our approximation near the feature values of interest while providing more coverage of the teacher’s behavior. This choice is subject to two potential critiques: the training data that we use to produce $F_X$ is itself a source of uncertainty, or that simply specifying $F_X$ to be the empirical distribution of the training data we can remove all variability associated with it.

Here we conduct an experiment that shows that using the Gaussian kernel smoothing both produces greater distillation fidelity to the teacher, and reduces the variability inherited from the training data distribution relative to simply using the training data. In particular, we use the following response functions

Setup 1: $X \sim \text{ Unif }[-1, 1]^3$,
$$\begin{aligned} \text{ logit }P(y=1|x) = -0.5\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot \sin (\pi x_2 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$
Setup 2: $X \sim \text{ Unif }[-1, 1]^5$,
$$\begin{aligned}{} & {} \text{ logit }P(y=1|x) = -0.5(1+x_3x_4)\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot (\sin (\pi x_2 / 2)\\ {}{} & {} \quad + \sin (\pi x_5 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$

We repeat 100 experiments in each of the two setting, where for each experiment we bootstrap a new sample of 1000 data points from a uniform distribution and either build a standard CART using the new sample and the teacher’s responses or apply the approximation tree algorithm with a kernel smoother using bandwidth $\sigma = 0.2$. Figure 15 confirms that employing the smoother contributes to both reproducibility and faithfulness to the teacher model.

1.6 Reproducibility of approximation tree for neural networks

Most of our experiments are conducted with random forests, which comes with statistical estimates of sample variability to help guide the stopping rule. But the approximation tree can be used for any model types if we only focus on the first objective of ensuring that the distillation process reliably produces the same splits. In this part we present some results when the teacher model is a neural network.

We use Breast Cancer Data for demonstration purposes, see Sect. 5.4.2 for details. Here we build a neural network of one hidden layer of size 5. Figure 16 shows the reproducibility of AppTree from its top 4 layers with different $N_{ps}$ values, which is the cap on the maximal number of pseudo examples AppTree can generate.

We observe a similar pattern as in Fig. 3: increasing Nps can improve reproducibility, but larger values than are often employed may be needed to make this satisfactory. Compared to random forests as teacher models (Fig. 9), neural networks produce slightly more stable BASE. This is because Breast Cancer Data is a relatively simple classification task for a neural network, which is able to produce class probabilities close to 0 or 1. Our approach can still improve reproducibility of the approximation, resulting in fewer distinct structures.

Appendix C: Approximation tree for CAD-MDD data

See Table 4 for full details of the approximation tree for CAD-MDD data.

Table 4 Details of the approximation tree for the CAD-MDD data

Full size table

Appendix D: Approximation tree for Breast Cancer Data

See Tables 5 and 6 for full details of the approximation tree for Breast Cancer Data. The data set used has 8 attributes and 699 samples.^{Footnote 2}

Table 5 Feature names correspondence for Breast Cancer Data

Full size table

Table 6 Details of the approximation tree for Breast Cancer Data

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Zhou, Z. & Hooker, G. Approximation trees: statistical reproducibility in model distillation. Data Min Knowl Disc 38, 3308–3346 (2024). https://doi.org/10.1007/s10618-022-00907-3

Download citation

Received: 29 March 2021
Accepted: 01 December 2022
Published: 11 January 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10618-022-00907-3

Keywords

Part of a collection:

Special Issue on Explainable and Interpretable Machine Learning and Data Mining

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximation trees: statistical reproducibility in model distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Approximation of SHAP Values for Randomized Tree Ensembles

CHIRPS: Explaining random forest classification

Example-Based Explanations of Random Forest Predictions

Data Availibility Statement

Code Availability

Notes

References

Acknowledgements

Funding