Skip to main content
Log in

Approximation trees: statistical reproducibility in model distillation

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper examines the reproducibility of learned explanations for black-box predictions via model distillation using classification trees. We find that common tree distillation methods fail to reproduce a single stable explanation when applied to the same teacher model due the randomness of the distillation process. We study this issue of reliable interpretation and propose a standardized framework for tree distillation to achieve reproducibility. The proposed framework consists of (1) a statistical test to stabilize tree splits, and (2) a stopping rule for tree building when using a teacher that provides an estimate of the uncertainty of its predictions, e.g. random forests. We demonstrate the empirical performance of the proposed distillation method on a variety of synthetic and real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availibility Statement

See Appendix B.1.

Code Availability

All empirical study code files are available at https://github.com/siriuz42/approximation_trees.git.

Notes

  1. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

  2. https://www.rdocumentation.org/packages/mlbench/versions/2.1-3/topics/BreastCancer.

References

  • Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Learning certifiably optimal rule lists for categorical data. J Mach Learn Res 18(234):1–78

    MathSciNet  Google Scholar 

  • Athey S, Tibshirani J, Wager S (2016) Generalized random forests. arXiv preprint arXiv:1610.01271

  • Augasta MG, Kathirvalavakumar T (2012) Reverse engineering the neural networks for rule extraction in classification problems. Neural Process Lett 35(2):131–150

    Article  Google Scholar 

  • Banerjee M, McKeague IW et al (2007) Confidence sets for split points in decision trees. Ann Stat 35(2):543–574

    Article  MathSciNet  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 66:289–300

    Article  MathSciNet  Google Scholar 

  • Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Breiman L, Shang N (1996) Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report 1(2):4

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton

    Google Scholar 

  • Bucila C, Caruana R, Niculescu-Mizil A (2006) Model compression: making big, slow models practical. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD’06)

  • Buciluǎ C, Caruana R, Niculescu-Mizil A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 535–541

  • Chipman HA, George EI, McCulloch RE et al (2010) Bart: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298

    Article  MathSciNet  Google Scholar 

  • Chouldechova A (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2):153–163

    Article  Google Scholar 

  • Craven MW, Shavlik JW (1995) Extracting tree-structured representations of trained networks. In: NIPS

  • Croz JJD, Higham NJ (1992) Stability of methods for matrix inversion. IMA J Numer Anal 12(1):1–19

    Article  MathSciNet  Google Scholar 

  • Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121

    Article  Google Scholar 

  • Ghosal I, Hooker G (2018) Boosting random forests to reduce bias; one-step boosted forest and its variance estimate. arXiv preprint arXiv:1803.08000

  • Gibbons RD, Hooker G, Finkelman MD, Weiss DJ, Pilkonis PA, Frank E, Moore T, Kupfer DJ (2013) The computerized adaptive diagnostic test for major depressive disorder (cad-mdd): a screening tool for depression. J Clin Psychiatry 74(7):1–478

    Article  Google Scholar 

  • Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819

    Article  Google Scholar 

  • He H, Eisner J, Daume H (2012) Imitation learning by coaching. In: Advances in neural information processing systems, pp 3149–3157

  • Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  • Hu L, Chen J, Nair VN, Sudjianto A (2018) Locally interpretable models and effects based on supervised partitioning (LIME-SUP). ArXiv e-prints arXiv:1806.00663

  • Johansson U, Niklasson L (2009) Evolving decision trees using oracle guides. In: IEEE symposium on computational intelligence and data mining, 2009 (CIDM’09). IEEE, pp 238–244

  • Johansson U, Sönströd C, Löfström T (2010) Oracle coached decision trees and lists. In: International symposium on intelligent data analysis. Springer, pp 67–78

  • Johansson U, Sönströd C, Löfström T (2011) One tree to explain them all. In: 2011 IEEE congress on evolutionary computation (CEC). IEEE, pp 1444–1451

  • Krishnan R, Sivakumar G, Bhattacharya P (1999) Extracting decision trees from trained neural networks. Pattern Recognit 32(12):66

    Article  Google Scholar 

  • Last M, Maimon O, Minkov E (2002) Improving stability of decision trees. Int J Pattern Recognit Artif Intell 16(02):145–159

    Article  Google Scholar 

  • Li RH, Belford GG (2002) Instability of decision tree classification algorithms. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 570–575

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171

    Article  Google Scholar 

  • Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG (2019) A simple baseline for Bayesian uncertainty in deep learning. In: Advances in neural information processing systems, pp 13153–13164

  • Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43(4):570–577

    Article  MathSciNet  Google Scholar 

  • Mentch L, Hooker G (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res 17(1):841–881

    MathSciNet  Google Scholar 

  • Mentch L, Hooker G (2017) Formal hypothesis tests for additive structure in random forests. J Comput Graph Stat 26(3):589–597

    Article  MathSciNet  Google Scholar 

  • Mobahi H, Farajtabar M, Bartlett PL (2020) Self-distillation amplifies regularization in Hilbert space. arXiv preprint arXiv:2002.05715

  • Peng W, Coleman T, Mentch L (2019) Asymptotic distributions and rates of convergence for random forests via generalized u-statistics. arXiv preprint arXiv:1905.10651

  • Pham H, Dai Z, Xie Q, Le QV (2021) Meta pseudo labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11557–11568

  • Quinlan JR (1987) Generating production rules from decision trees. In: IJCAI, vol 87, pp 304–307. Citeseer

  • Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier

  • Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144

  • Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1(5):206–215

    Article  Google Scholar 

  • Sexton J, Laake P (2009) Standard errors for bagged and random forest estimators. Comput Stat Data Anal 53(3):801–811

    Article  MathSciNet  Google Scholar 

  • Strecht P (2015) A survey of merging decision trees data mining approaches. In: Proceedings of the 10th doctoral symposium in informatics engineering, pp 36–47

  • Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv J Law Technol 31:841

  • Wager S, Athey S (2017) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 6:66

    Google Scholar 

  • Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15(1):1625–1651

    MathSciNet  Google Scholar 

  • Wang F, Rudin C (2015) Falling rule lists. In: Artificial intelligence and statistics, pp 1013–1022

  • Wolberg WH, Mangasarian OL (1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci 87(23):9193–9196

    Article  Google Scholar 

  • Yuan L, Tay FE, Li G, Wang T, Feng J (2020) Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3903–3911

  • Zhang J (1992) Selecting typical instances in instance-based learning. In: Machine learning proceedings. Elsevier, pp 470–479

  • Zhou ZH (2019) Ensemble methods: foundations and algorithms. Chapman and Hall/CRC

  • Zhou Y, Hooker G (2018) Boulevard: regularized stochastic gradient boosted trees and their limiting distribution. ArXiv e-prints

  • Zhou Z, Mentch L, Hooker G (2021) V-statistics and variance estimation. J Mach Learn Res 22(287):1–48

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge National Institutes of Health (NIH) R03DA036683, National Science Foundation (NFS) DMS-1053252 and NSF DEB-1353039. The authors would like to thank Robert Gibbons for providing the CAD-MDD data.

Funding

NIH R03DA036683, NSF DMS-1053252 and NSF DEB-1353039.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengze Zhou.

Ethics declarations

Conflicts of interest

The authors of the paper have affiliations with Cornell University, the University of California, Berkeley, Google and Meta.

Additional information

Responsible editor: Martin Atzmueller, Johannes Fürnkranz, Tomáş Kliegr and Ute Schmid.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Limiting distribution of Gini index

Following the discussion, we now move to analyze a pseudo sample. Using similar notation, we denote the numbers of sample points and the probabilities over class labels in each child by, for \(p \in \{1,2\}, j \in \{1,\dots ,k\},\)

$$\begin{aligned} n^p_0 = \sum _{i=1}^n \mathbb {1}_{\{G_p(X_i)=0\}}, \quad {\hat{\theta }}^p_{0,j} = \frac{1}{n^p_0} \sum _{i=1}^n Y_i^j\cdot \mathbb {1}_{\{G_p(X_i)=0\}}, \\ n^p_1 = \sum _{i=1}^n \mathbb {1}_{\{G_p(X_i)=1\}}, \quad {\hat{\theta }}^p_{1,j} = \frac{1}{n^p_1} \sum _{i=1}^n Y_i^j\cdot \mathbb {1}_{\{G_p(X_i)=1\}}. \end{aligned}$$

For simplicity we write, for \(p \in \{1,2\},q \in \{0,1\},\) \( N^p_q = ( n^p_q{\hat{\theta }}^p_{q,1}, \dots , n^p_q{\hat{\theta }}^p_{q,k} )^T. \) Employing a multivariate CLT we obtain

$$\begin{aligned} \sqrt{n} \left( \frac{1}{n}\begin{bmatrix} N^1_0 \\ N^1_1 \\ N^2_0 \\ N^2_1 \end{bmatrix} - \begin{bmatrix} \varTheta ^1_0 \\ \varTheta ^1_1 \\ \varTheta ^2_0 \\ \varTheta ^2_1 \end{bmatrix} \right) \longrightarrow N(0, \varSigma ). \end{aligned}$$

To relate this limiting distribution to the difference of Gini indices we shall employ the \(\delta \)-method. Consider the analytic function \(f: {\mathbb {R}}^{4k} \rightarrow {\mathbb {R}}\) s.t.

$$\begin{aligned} f(x_1, \dots , x_{4k}) =&-\frac{1}{\pi ^1_0}\sum _{i=1}^k x_i^2 -\frac{1}{\pi ^1_1} \sum _{i=k+1}^{2k} x_i^2 \\&+\frac{1}{\pi ^2_0} \sum _{i=2k+1}^{3k} x_i^2 +\frac{1}{\pi ^2_1} \sum _{i=3k+1}^{4k} x_i^2. \end{aligned}$$

The \(\delta \)-method implies that

$$\begin{aligned} \sqrt{n} \left( f\left( \frac{1}{n}\begin{bmatrix} N^1_0 \\ N^1_1 \\ N^2_0 \\ N^2_1 \end{bmatrix}\right) - f\left( \begin{bmatrix} \varTheta ^1_0 \\ \varTheta ^1_1 \\ \varTheta ^2_0 \\ \varTheta ^2_1 \end{bmatrix}\right) \right) \longrightarrow N(0, \varTheta ^T\varSigma \varTheta ). \end{aligned}$$
(9)

We should point out that while (9) provides us with the CLT we need to assess the difference between two Gini indices. After expanding (9),

$$\begin{aligned} \sqrt{n} \left( ({\hat{g}}_{1,n}-{\hat{g}}_{2,n}) - (g_1 - g_2)\right) \longrightarrow N(0, \varTheta ^T \varSigma \varTheta ). \end{aligned}$$

or approximately,

$$\begin{aligned} ({\hat{g}}_{1,n} - {\hat{g}}_{2,n}) - (g_1 - g_2) \sim N \left( 0, \frac{\varTheta ^T \varSigma \varTheta }{n}\right) . \end{aligned}$$

Appendix B: Additional results on AppTree

1.1 Real datasets

In this section we will show the results of our method on eight datasets. Seven of them are available on the UCI repository (Lichman 2013) and one is the CAD-MDD data used by Gibbons et al. (2013). We manually split each dataset into train and test for cross validation. In the main body, Table 1 shows the number of covariates, training and testing sample size and the levels of responses for each dataset.

To decide the generative distribution of covariates before running our algorithm, we perturb the empirical distribution by Gaussian noise whose variances are approximately 1/50 of the ranges of corresponding covariates. The probability of jumping to a neighboring category for discrete covariates is set to be 1/7.

We compare across four methods here: classification trees (CART), random forests (RF), our proposed approximation tree (AppTree), and a baseline method (BASE). Previous work (Johansson and Niklasson 2009; Johansson et al. 2010) fixes the number of pseudo sample from the oracle during the coaching procedure. Analogously, we set BASE to be a non-adaptive version of our AppTree which requests a pseudo sample from the RF only once at the root node and uses it all the way down. The pseudo sample size is set to be 9 times the size of the training data; this is larger than the samples used in Johansson and Niklasson 2009; Johansson et al. 2010 and is designed as a reasonable blind decision without any prior information.

We use the same setting for all datasets for consistency. For each dataset, we train a RF containing 200 trees, a CART tree, then 100 AppTrees and 100 BASE trees approximating the RF. \(N_{ps} = 500,000\), which means each node of AppTree can generate at most \(5\times 10^5\) pseudo examples to decide its split. CART, BASE and AppTree all grow to the 6th layer including the root. Confidence level \(\alpha \) is set to be 0.1.

1.2 Binary classification

Fig. 12
figure 12

Performance evaluation on binary classification datasets. From top to bottom: CAD-MDD, BreastCancer, Car, ClimateModel. From left to right: ROC curves, fidelity to RF on testing set, fidelity to RF on new data points

Figure 12 shows the evaluation of methods on binary classification datasets. Johansson et al. (2011) has pointed out that a single decision tree is already capable of mimicking an oracle predictor to make highly accurate predictions given the oracle is not overly complicated. Our simulation shows similar results, as all three methods CART, BASE and AppTree tightly follow the ROC curves of the RF and there is no significance difference among them. Fidelity is measured as the frequency of a model agreeing with the RF when predicting on same input covariates. We use a moving threshold as the classification boundary, and evaluate the fidelity on both the testing data and the new data (as marked “test" and “new" in the plot). First, all three 6-layer trees can approximate the RF with over 80% probability for almost any given classification threshold, and there is no significant difference among them. Further, the behaviors of both “test” and “new” plot seem similar, which lends support to our generative covariate distribution. While the overall 80% fidelity may seem not powerful enough to make those trees aligned with the oracle RF, we can build the trees deeper to better “overfit” the RF.

1.3 Multiclass classification

Fig. 13
figure 13

Performance evaluation on multiclass (3-class) classification dataset Cardiotocography. ROC curves are plotted in a one versus all fashion. Fidelity is only checked on testing data

Fig. 14
figure 14

Performance evaluation on multiclass (3-class) classification datasets WineRed (top two rows) and WineWhite (bottom two rows). ROC curves are plotted in a one versus all fashion. Fidelity is only checked on testing data

Figures 13 and 14 demonstrate evaluate performance on the three multiclass classification datasets. We observe similar patterns as we did in binary cases that all three tree methods work similarly. ROC curves of RF show poorer performance than in binary classification, and ROC curves of the three tree methods are further from random forests, suggesting that deeper trees might be necessary. Nonetheless, all tree methods again agree with the RF on about 80% of the predictions made by RF on the test data. We have therefore shown that our reproducibility requirement of AppTree does not undermine its predictive power and fidelity with the black box when compared with other tree methods.

1.4 Impact of \(N_{ps}\) on distillation accuracy and fidelity

Table 3 Student accuracy and fidelity metrics on the real world datasets with varying \(N_{ps}\)

Table 3 reflects the student’s accuracy and fidelity remains unchanged with respect to a changing \(N_{ps}\). This behavior is expected and guaranteed by the distillation practice. We anticipate \(N_{ps}\) in this case solely plays the role of a reproducibility hyperparameter.

1.5 Smoothing the sampler

Throughout we have assumed that \(F_X\)—the distribution for generating covariates during approximation tree construction—is fixed by the practitioner. This is to allow differing modeling goals to influence its choice; we might either want to ensure coverage of the whole space of possible feature values, or they may want to localize to a particular patient of interest, an approach not dissimilar to LIME (Ribeiro et al. 2016).

In most of our empirical studies, we have employed a Gaussian kernel smoother of the training data as a means of focusing our approximation near the feature values of interest while providing more coverage of the teacher’s behavior. This choice is subject to two potential critiques: the training data that we use to produce \(F_X\) is itself a source of uncertainty, or that simply specifying \(F_X\) to be the empirical distribution of the training data we can remove all variability associated with it.

Here we conduct an experiment that shows that using the Gaussian kernel smoothing both produces greater distillation fidelity to the teacher, and reduces the variability inherited from the training data distribution relative to simply using the training data. In particular, we use the following response functions

  • Setup 1: \(X \sim \text{ Unif }[-1, 1]^3\),

    $$\begin{aligned} \text{ logit }P(y=1|x) = -0.5\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot \sin (\pi x_2 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$
  • Setup 2: \(X \sim \text{ Unif }[-1, 1]^5\),

    $$\begin{aligned}{} & {} \text{ logit }P(y=1|x) = -0.5(1+x_3x_4)\cdot \mathbb {1}_{\{x_1 < 0\}} + 0.5 \cdot (\sin (\pi x_2 / 2)\\ {}{} & {} \quad + \sin (\pi x_5 / 2)\cdot \mathbb {1}_{\{x_1 > 0\}}. \end{aligned}$$

We repeat 100 experiments in each of the two setting, where for each experiment we bootstrap a new sample of 1000 data points from a uniform distribution and either build a standard CART using the new sample and the teacher’s responses or apply the approximation tree algorithm with a kernel smoother using bandwidth \(\sigma = 0.2\). Figure 15 confirms that employing the smoother contributes to both reproducibility and faithfulness to the teacher model.

Fig. 15
figure 15

Comparison between bootstrap sampler and Gaussian kernel smoother in approximation trees when the distillation is given a bootstrap of the original sample distribution. LHS: unique tree structures on two simple synthetic settings. Kernel smoother outputs are much more stable. In each column, a single black bar represents a unique structure of the tree, while the height of the bar represents the number of occurrence of that structure out of 100 repetitions. RHS: prediction accuracy with respect to the teacher model

1.6 Reproducibility of approximation tree for neural networks

Most of our experiments are conducted with random forests, which comes with statistical estimates of sample variability to help guide the stopping rule. But the approximation tree can be used for any model types if we only focus on the first objective of ensuring that the distillation process reliably produces the same splits. In this part we present some results when the teacher model is a neural network.

We use Breast Cancer Data for demonstration purposes, see Sect. 5.4.2 for details. Here we build a neural network of one hidden layer of size 5. Figure 16 shows the reproducibility of AppTree from its top 4 layers with different \(N_{ps}\) values, which is the cap on the maximal number of pseudo examples AppTree can generate.

We observe a similar pattern as in Fig. 3: increasing Nps can improve reproducibility, but larger values than are often employed may be needed to make this satisfactory. Compared to random forests as teacher models (Fig. 9), neural networks produce slightly more stable BASE. This is because Breast Cancer Data is a relatively simple classification task for a neural network, which is able to produce class probabilities close to 0 or 1. Our approach can still improve reproducibility of the approximation, resulting in fewer distinct structures.

Fig. 16
figure 16

Reproducibility of AppTree with different \(N_{ps}\) values and a neural network teacher. The top 4 layers with \(N_{ps}=5{,}000\) and 50,000 of the trees are summarized respectively. In each column, a single black bar represents a unique structure of the tree, while the height of the bar represents the number of occurrence of that structure out of 100 repetitions

Appendix C: Approximation tree for CAD-MDD data

See Table 4 for full details of the approximation tree for CAD-MDD data.

Table 4 Details of the approximation tree for the CAD-MDD data

Appendix D: Approximation tree for Breast Cancer Data

See Tables 5 and 6 for full details of the approximation tree for Breast Cancer Data. The data set used has 8 attributes and 699 samples.Footnote 2

Table 5 Feature names correspondence for Breast Cancer Data
Table 6 Details of the approximation tree for Breast Cancer Data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Zhou, Z. & Hooker, G. Approximation trees: statistical reproducibility in model distillation. Data Min Knowl Disc 38, 3308–3346 (2024). https://doi.org/10.1007/s10618-022-00907-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00907-3

Keywords