Skip to main content

Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14171))

  • 1619 Accesses

Abstract

Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. We study the theoretical properties for SNAM with novel techniques to tackle the non-parametric truth, thus extending from classical sparse linear models such as the LASSO, which only works on the parametric truth. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as \(t\rightarrow \infty \), and that the estimation error of SNAM vanishes asymptotically as \(n\rightarrow \infty \). We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the ‘identifiability’, recovering each feature’s effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM (Appendix can be found at https://arxiv.org/abs/2202.12482.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The backfitting algorithm can be recovered from SPAM algorithm (see appendix) when \(\lambda =0\).

  2. 2.

    In ‘Non-param truth’ column of Table 1, Yes/No means whether a model works without assuming that the truth is parametric.

  3. 3.

    Unfortunately, \(\boldsymbol{\varTheta }(t)\) will be pushed away from its initialization \(\boldsymbol{\varTheta }(0)\) towards zero even under weak regularization, breaking the lazy training assumption [13, 15].

  4. 4.

    When all sub-networks have the same architecture, we write \(M=mp\) where the last hidden layer width m. More generally, suppose the j-th sub-network has last hidde layer width \(m_j\), then \(M=\sum _j m_j\).

  5. 5.

    Code is available at https://github.com/ShiyunXu/SNAM.git.

  6. 6.

    The data preprocessing follows https://github.com/propublica/compas-analysis.

  7. 7.

    The original dataset has 168 features. We remove the column material and all columns with variance less than 5%.

References

  1. Agarwal, R., Frosst, N., Zhang, X., Caruana, R., Hinton, G.E.: Neural additive models: interpretable machine learning with neural nets. arXiv preprint arXiv:2004.13912 (2020)

  2. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252. PMLR (2019)

    Google Scholar 

  3. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. Propublica (2016)

    Google Scholar 

  4. Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955 (2019)

  5. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  6. Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., Candès, E.J.: Slope-adaptive variable selection via convex optimization. Ann. Appl. Stat. 9(3), 1103 (2015)

    Article  MathSciNet  Google Scholar 

  7. Boyd, S., Xiao, L., Mutapcic, A.: Subgradient methods. Lecture notes of EE392o, Stanford University, Autumn Quarter 2004, 2004–2005 (2003)

    Google Scholar 

  8. Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 80(391), 580–598 (1985)

    Article  MathSciNet  Google Scholar 

  9. Brzyski, D., Gossmann, A., Su, W., Bogdan, M.: Group slope-adaptive selection of groups of predictors. J. Am. Stat. Assoc. 114(525), 419–433 (2019)

    Article  MathSciNet  Google Scholar 

  10. Bu, Z., Klusowski, J., Rush, C., Su, W.J.: Characterizing the slope trade-off: a variational perspective and the donoho-tanner limit. arXiv preprint arXiv:2105.13302 (2021)

  11. Bu, Z., Xu, S., Chen, K.: A dynamical view on optimization algorithms of overparameterized neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 3187–3195. PMLR (2021)

    Google Scholar 

  12. Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20192-9

    Book  Google Scholar 

  13. Chen, Z., Cao, Y., Gu, Q., Zhang, T.: A generalized neural tangent kernel analysis for two-layer neural networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 13363–13373. Curran Associates, Inc. (2020). www.proceedings.neurips.cc/paper/2020/file/9afe487de556e59e6db6c862adfe25a4-Paper.pdf

  14. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)

  15. Fang, C., Dong, H., Zhang, T.: Mathematical models of overparameterized neural networks. Proc. IEEE 109(5), 683–703 (2021)

    Article  Google Scholar 

  16. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7

    Book  Google Scholar 

  17. Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)

    Article  MathSciNet  Google Scholar 

  18. Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Routledge, Abingdon (2017)

    Book  Google Scholar 

  19. Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018)

  20. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 28, 379–387 (2015)

    Google Scholar 

  21. Lou, Y., Caruana, R., Gehrke, J.: Intelligible models for classification and regression. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158 (2012)

    Google Scholar 

  22. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)

    Google Scholar 

  23. Meier, L., Van De Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)

    Article  MathSciNet  Google Scholar 

  24. Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, Heidelberg (1996). https://doi.org/10.1007/978-1-4612-0745-0_2

  25. Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. Roy. Stat. Soc. Ser. A (General) 135(3), 370–384 (1972)

    Article  Google Scholar 

  26. Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. Adv. Neural Inf. Process. Syst. 27, 1574–1582 (2014)

    Google Scholar 

  27. Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019)

  28. Pace, R.K., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997)

    Article  Google Scholar 

  29. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

    Article  Google Scholar 

  30. Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5. Citeseer (2007)

    Google Scholar 

  31. Ravikumar, P., Lafferty, J., Liu, H., Wasserman, L.: Sparse additive models. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 71(5), 1009–1030 (2009)

    Article  MathSciNet  Google Scholar 

  32. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  33. Shapley, L.S.: 17. A Value for n-person Games. Princeton University Press, Princeton (2016)

    Google Scholar 

  34. Shor, N.Z.: Minimization Methods for Non-Differentiable Functions, vol. 3. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-82118-9

    Book  Google Scholar 

  35. Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41(3), 647–665 (2014)

    Article  Google Scholar 

  36. Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Adv. Neural Inf. Process. Syst. 27, 2510–2518 (2014)

    Google Scholar 

  37. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)

    Article  MathSciNet  Google Scholar 

  38. Tibshirani, R., Wasserman, L.: Sparsity, the lasso, and friends. Lecture notes from “Statistical Machine Learning,” Carnegie Mellon University, Spring (2017)

    Google Scholar 

  39. Van De Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the lasso. Electron. J. Stat. 3, 1360–1392 (2009)

    MathSciNet  Google Scholar 

  40. Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\) -constrained quadratic programming (lasso). IEEE Trans. Inf. Theory 55(5), 2183–2202 (2009)

    Article  Google Scholar 

  41. Wei, C., Lee, J., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel (2019)

    Google Scholar 

  42. Xiao, L., Pennington, J., Schoenholz, S.: Disentangling trainability and generalization in deep neural networks. In: International Conference on Machine Learning, pp. 10462–10472. PMLR (2020)

    Google Scholar 

  43. Yehudai, G., Shamir, O.: On the power and limitations of random features for understanding neural networks. Adv. Neural Inf. Process. Syst. 32, 6598–6608 (2019)

    Google Scholar 

  44. Zhang, Y., Bu, Z.: Efficient designs of slope penalty sequences in finite dimension. In: International Conference on Artificial Intelligence and Statistics, pp. 3277–3285. PMLR (2021)

    Google Scholar 

  45. Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep relu networks. Mach. Learn. 109(3), 467–492 (2020)

    Article  MathSciNet  Google Scholar 

  46. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    Article  MathSciNet  Google Scholar 

  47. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

SX is supported through partnership with GSK. PC was supported by grants from the National Science Foundation (IIS-2145164, CCF-2212519) and the Office of Naval Research (N00014-22-1-2255). IB is supported by the National Institute of Mental Health (R01MH116884).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiqi Bu .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

High-stake applications like healthcare, criminal records empowered by deep learning raise people’s concern about algorithms’ liability, fairness, and interpretability. Our method can help build a fair, trustworthy and explainable community by seeking the reason behind machine learning predictions. Sometimes, the system may predict upon discrimination without realizing it (such as the COMPAS algorithm). Examining into each feature’s contribution to the outcome provides a possibility of avoid learning with bias. Our method is useful especially in high dimensional datasets, such as some medical tabular records. Hence, our paper has an important impact on ethical machine learning. Yet, we emphasize that interpretable machine learning does not automatically guarantee its trustworthiness: it can still make mistakes and bias towards certain group, even though it can explain why it does so.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, S., Bu, Z., Chaudhari, P., Barnett, I.J. (2023). Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14171. Springer, Cham. https://doi.org/10.1007/978-3-031-43418-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43418-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43417-4

  • Online ISBN: 978-3-031-43418-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics