Abstract
Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. We study the theoretical properties for SNAM with novel techniques to tackle the non-parametric truth, thus extending from classical sparse linear models such as the LASSO, which only works on the parametric truth. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as \(t\rightarrow \infty \), and that the estimation error of SNAM vanishes asymptotically as \(n\rightarrow \infty \). We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the ‘identifiability’, recovering each feature’s effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM (Appendix can be found at https://arxiv.org/abs/2202.12482.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The backfitting algorithm can be recovered from SPAM algorithm (see appendix) when \(\lambda =0\).
- 2.
In ‘Non-param truth’ column of Table 1, Yes/No means whether a model works without assuming that the truth is parametric.
- 3.
- 4.
When all sub-networks have the same architecture, we write \(M=mp\) where the last hidden layer width m. More generally, suppose the j-th sub-network has last hidde layer width \(m_j\), then \(M=\sum _j m_j\).
- 5.
Code is available at https://github.com/ShiyunXu/SNAM.git.
- 6.
The data preprocessing follows https://github.com/propublica/compas-analysis.
- 7.
The original dataset has 168 features. We remove the column material and all columns with variance less than 5%.
References
Agarwal, R., Frosst, N., Zhang, X., Caruana, R., Hinton, G.E.: Neural additive models: interpretable machine learning with neural nets. arXiv preprint arXiv:2004.13912 (2020)
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252. PMLR (2019)
Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. Propublica (2016)
Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955 (2019)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., Candès, E.J.: Slope-adaptive variable selection via convex optimization. Ann. Appl. Stat. 9(3), 1103 (2015)
Boyd, S., Xiao, L., Mutapcic, A.: Subgradient methods. Lecture notes of EE392o, Stanford University, Autumn Quarter 2004, 2004–2005 (2003)
Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 80(391), 580–598 (1985)
Brzyski, D., Gossmann, A., Su, W., Bogdan, M.: Group slope-adaptive selection of groups of predictors. J. Am. Stat. Assoc. 114(525), 419–433 (2019)
Bu, Z., Klusowski, J., Rush, C., Su, W.J.: Characterizing the slope trade-off: a variational perspective and the donoho-tanner limit. arXiv preprint arXiv:2105.13302 (2021)
Bu, Z., Xu, S., Chen, K.: A dynamical view on optimization algorithms of overparameterized neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 3187–3195. PMLR (2021)
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20192-9
Chen, Z., Cao, Y., Gu, Q., Zhang, T.: A generalized neural tangent kernel analysis for two-layer neural networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 13363–13373. Curran Associates, Inc. (2020). www.proceedings.neurips.cc/paper/2020/file/9afe487de556e59e6db6c862adfe25a4-Paper.pdf
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Fang, C., Dong, H., Zhang, T.: Mathematical models of overparameterized neural networks. Proc. IEEE 109(5), 683–703 (2021)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)
Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Routledge, Abingdon (2017)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018)
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 28, 379–387 (2015)
Lou, Y., Caruana, R., Gehrke, J.: Intelligible models for classification and regression. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158 (2012)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)
Meier, L., Van De Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)
Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, Heidelberg (1996). https://doi.org/10.1007/978-1-4612-0745-0_2
Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. Roy. Stat. Soc. Ser. A (General) 135(3), 370–384 (1972)
Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. Adv. Neural Inf. Process. Syst. 27, 1574–1582 (2014)
Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019)
Pace, R.K., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5. Citeseer (2007)
Ravikumar, P., Lafferty, J., Liu, H., Wasserman, L.: Sparse additive models. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 71(5), 1009–1030 (2009)
Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Shapley, L.S.: 17. A Value for n-person Games. Princeton University Press, Princeton (2016)
Shor, N.Z.: Minimization Methods for Non-Differentiable Functions, vol. 3. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-82118-9
Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41(3), 647–665 (2014)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Adv. Neural Inf. Process. Syst. 27, 2510–2518 (2014)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)
Tibshirani, R., Wasserman, L.: Sparsity, the lasso, and friends. Lecture notes from “Statistical Machine Learning,” Carnegie Mellon University, Spring (2017)
Van De Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the lasso. Electron. J. Stat. 3, 1360–1392 (2009)
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\) -constrained quadratic programming (lasso). IEEE Trans. Inf. Theory 55(5), 2183–2202 (2009)
Wei, C., Lee, J., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel (2019)
Xiao, L., Pennington, J., Schoenholz, S.: Disentangling trainability and generalization in deep neural networks. In: International Conference on Machine Learning, pp. 10462–10472. PMLR (2020)
Yehudai, G., Shamir, O.: On the power and limitations of random features for understanding neural networks. Adv. Neural Inf. Process. Syst. 32, 6598–6608 (2019)
Zhang, Y., Bu, Z.: Efficient designs of slope penalty sequences in finite dimension. In: International Conference on Artificial Intelligence and Statistics, pp. 3277–3285. PMLR (2021)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep relu networks. Mach. Learn. 109(3), 467–492 (2020)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
SX is supported through partnership with GSK. PC was supported by grants from the National Science Foundation (IIS-2145164, CCF-2212519) and the Office of Naval Research (N00014-22-1-2255). IB is supported by the National Institute of Mental Health (R01MH116884).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
High-stake applications like healthcare, criminal records empowered by deep learning raise people’s concern about algorithms’ liability, fairness, and interpretability. Our method can help build a fair, trustworthy and explainable community by seeking the reason behind machine learning predictions. Sometimes, the system may predict upon discrimination without realizing it (such as the COMPAS algorithm). Examining into each feature’s contribution to the outcome provides a possibility of avoid learning with bias. Our method is useful especially in high dimensional datasets, such as some medical tabular records. Hence, our paper has an important impact on ethical machine learning. Yet, we emphasize that interpretable machine learning does not automatically guarantee its trustworthiness: it can still make mistakes and bias towards certain group, even though it can explain why it does so.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, S., Bu, Z., Chaudhari, P., Barnett, I.J. (2023). Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14171. Springer, Cham. https://doi.org/10.1007/978-3-031-43418-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-43418-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43417-4
Online ISBN: 978-3-031-43418-1
eBook Packages: Computer ScienceComputer Science (R0)