Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity

Xu, Shiyun; Bu, Zhiqi; Chaudhari, Pratik; Barnett, Ian J.

doi:10.1007/978-3-031-43418-1_21

Shiyun Xu¹²,
Zhiqi Bu¹²,
Pratik Chaudhari¹³ &
…
Ian J. Barnett¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14171))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1619 Accesses

Abstract

Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. We study the theoretical properties for SNAM with novel techniques to tackle the non-parametric truth, thus extending from classical sparse linear models such as the LASSO, which only works on the parametric truth. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as $t\rightarrow \infty $, and that the estimation error of SNAM vanishes asymptotically as $n\rightarrow \infty $. We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the ‘identifiability’, recovering each feature’s effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM (Appendix can be found at https://arxiv.org/abs/2202.12482.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LASSO regularization within the LocalGLMnet architecture

Article 13 December 2022

Neural partially linear additive model

Article 28 December 2023

Regionally Additive Models: Explainable-by-Design Models Minimizing Feature Interactions

Notes

1.
The backfitting algorithm can be recovered from SPAM algorithm (see appendix) when $\lambda =0$.
2.
In ‘Non-param truth’ column of Table 1, Yes/No means whether a model works without assuming that the truth is parametric.
3.
Unfortunately, $\boldsymbol{\varTheta }(t)$ will be pushed away from its initialization $\boldsymbol{\varTheta }(0)$ towards zero even under weak regularization, breaking the lazy training assumption [13, 15].
4.
When all sub-networks have the same architecture, we write $M=mp$ where the last hidden layer width m. More generally, suppose the j-th sub-network has last hidde layer width $m_j$, then $M=\sum _j m_j$.
5.
Code is available at https://github.com/ShiyunXu/SNAM.git.
6.
The data preprocessing follows https://github.com/propublica/compas-analysis.
7.
The original dataset has 168 features. We remove the column material and all columns with variance less than 5%.

References

Agarwal, R., Frosst, N., Zhang, X., Caruana, R., Hinton, G.E.: Neural additive models: interpretable machine learning with neural nets. arXiv preprint arXiv:2004.13912 (2020)
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252. PMLR (2019)
Google Scholar
Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. Propublica (2016)
Google Scholar
Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955 (2019)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., Candès, E.J.: Slope-adaptive variable selection via convex optimization. Ann. Appl. Stat. 9(3), 1103 (2015)
Article MathSciNet Google Scholar
Boyd, S., Xiao, L., Mutapcic, A.: Subgradient methods. Lecture notes of EE392o, Stanford University, Autumn Quarter 2004, 2004–2005 (2003)
Google Scholar
Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 80(391), 580–598 (1985)
Article MathSciNet Google Scholar
Brzyski, D., Gossmann, A., Su, W., Bogdan, M.: Group slope-adaptive selection of groups of predictors. J. Am. Stat. Assoc. 114(525), 419–433 (2019)
Article MathSciNet Google Scholar
Bu, Z., Klusowski, J., Rush, C., Su, W.J.: Characterizing the slope trade-off: a variational perspective and the donoho-tanner limit. arXiv preprint arXiv:2105.13302 (2021)
Bu, Z., Xu, S., Chen, K.: A dynamical view on optimization algorithms of overparameterized neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 3187–3195. PMLR (2021)
Google Scholar
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20192-9
Book Google Scholar
Chen, Z., Cao, Y., Gu, Q., Zhang, T.: A generalized neural tangent kernel analysis for two-layer neural networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 13363–13373. Curran Associates, Inc. (2020). www.proceedings.neurips.cc/paper/2020/file/9afe487de556e59e6db6c862adfe25a4-Paper.pdf
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Fang, C., Dong, H., Zhang, T.: Mathematical models of overparameterized neural networks. Proc. IEEE 109(5), 683–703 (2021)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Book Google Scholar
Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)
Article MathSciNet Google Scholar
Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Routledge, Abingdon (2017)
Book Google Scholar
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018)
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 28, 379–387 (2015)
Google Scholar
Lou, Y., Caruana, R., Gehrke, J.: Intelligible models for classification and regression. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158 (2012)
Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)
Google Scholar
Meier, L., Van De Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)
Article MathSciNet Google Scholar
Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, Heidelberg (1996). https://doi.org/10.1007/978-1-4612-0745-0_2
Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. Roy. Stat. Soc. Ser. A (General) 135(3), 370–384 (1972)
Article Google Scholar
Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. Adv. Neural Inf. Process. Syst. 27, 1574–1582 (2014)
Google Scholar
Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019)
Pace, R.K., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997)
Article Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Article Google Scholar
Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5. Citeseer (2007)
Google Scholar
Ravikumar, P., Lafferty, J., Liu, H., Wasserman, L.: Sparse additive models. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 71(5), 1009–1030 (2009)
Article MathSciNet Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Shapley, L.S.: 17. A Value for n-person Games. Princeton University Press, Princeton (2016)
Google Scholar
Shor, N.Z.: Minimization Methods for Non-Differentiable Functions, vol. 3. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-82118-9
Book Google Scholar
Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41(3), 647–665 (2014)
Article Google Scholar
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Adv. Neural Inf. Process. Syst. 27, 2510–2518 (2014)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)
Article MathSciNet Google Scholar
Tibshirani, R., Wasserman, L.: Sparsity, the lasso, and friends. Lecture notes from “Statistical Machine Learning,” Carnegie Mellon University, Spring (2017)
Google Scholar
Van De Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the lasso. Electron. J. Stat. 3, 1360–1392 (2009)
MathSciNet Google Scholar
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell _1$ -constrained quadratic programming (lasso). IEEE Trans. Inf. Theory 55(5), 2183–2202 (2009)
Article Google Scholar
Wei, C., Lee, J., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel (2019)
Google Scholar
Xiao, L., Pennington, J., Schoenholz, S.: Disentangling trainability and generalization in deep neural networks. In: International Conference on Machine Learning, pp. 10462–10472. PMLR (2020)
Google Scholar
Yehudai, G., Shamir, O.: On the power and limitations of random features for understanding neural networks. Adv. Neural Inf. Process. Syst. 32, 6598–6608 (2019)
Google Scholar
Zhang, Y., Bu, Z.: Efficient designs of slope penalty sequences in finite dimension. In: International Conference on Artificial Intelligence and Statistics, pp. 3277–3285. PMLR (2021)
Google Scholar
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep relu networks. Mach. Learn. 109(3), 467–492 (2020)
Article MathSciNet Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Article MathSciNet Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Article MathSciNet Google Scholar

Download references

Acknowledgements

SX is supported through partnership with GSK. PC was supported by grants from the National Science Foundation (IIS-2145164, CCF-2212519) and the Office of Naval Research (N00014-22-1-2255). IB is supported by the National Institute of Mental Health (R01MH116884).

Author information

Authors and Affiliations

Department of Applied Mathematics and Computational Science, University of Pennsylvania, Philadelphia, PA, USA
Shiyun Xu & Zhiqi Bu
Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA, USA
Pratik Chaudhari
Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
Ian J. Barnett

Authors

Shiyun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqi Bu
View author publications
You can also search for this author in PubMed Google Scholar
Pratik Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Ian J. Barnett
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqi Bu .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

High-stake applications like healthcare, criminal records empowered by deep learning raise people’s concern about algorithms’ liability, fairness, and interpretability. Our method can help build a fair, trustworthy and explainable community by seeking the reason behind machine learning predictions. Sometimes, the system may predict upon discrimination without realizing it (such as the COMPAS algorithm). Examining into each feature’s contribution to the outcome provides a possibility of avoid learning with bias. Our method is useful especially in high dimensional datasets, such as some medical tabular records. Hence, our paper has an important impact on ethical machine learning. Yet, we emphasize that interpretable machine learning does not automatically guarantee its trustworthiness: it can still make mistakes and bias towards certain group, even though it can explain why it does so.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, S., Bu, Z., Chaudhari, P., Barnett, I.J. (2023). Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14171. Springer, Cham. https://doi.org/10.1007/978-3-031-43418-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-43418-1_21
Published: 17 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43417-4
Online ISBN: 978-3-031-43418-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity