Skip to main content

Advertisement

Log in

Generalized hurdle count data models based on interpretable machine learning with an application to health care demand

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

The zero-inflated count data model has long been viewed as an important research topic owing to its enormously different disciplines. As early classical statistical models of linear and logarithmic mean transformation are difficult to be consistent with reality, an enhanced hurdle model based on machine learning methods is proposed. The decision tree, random forest, support vector, and XGBoost methods are introduced in the two stages of the hurdle model. This framework allows to capture the decision-making behavior and predict the count more flexibly and accurately. The generalized hurdle model consists of traditional discrete distributions, which can fit under-dispersed, equi-dispersed, or over-dispersed count data. The extended hurdle models are utilized to fit health care data and compare their performance with traditional count models. The results show that the generalized hurdle model with random forest performs best. Variable importance, break-down plots, and partial plots provide better interpretability for the extended model, which makes the results more reliable and transparent. To the best of our knowledge, this is the first study to generalize the hurdle model with interpretable machine learning methods in count data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Hartman M, Martin AB, Washington B, Catlin A (2022) National health expenditure accounts team: national health care spending in 2020: growth driven by federal spending in response to the COVID-19 pandemic: national health expenditures study examines US health care spending in 2020. Health Aff 41(1):13–25

    Article  Google Scholar 

  2. Rana RH, Alam K, Gow J (2021) Financial development and health expenditure nexus: a global perspective. Int J Financ Econ 26(1):1050–1063

    Article  Google Scholar 

  3. Chen T, Zhang H, Zhang B (2019) A semiparametric marginalized zero-inflated model for analyzing healthcare utilization panel data with missingness. J Appl Stat 46(16):2862–2883

    Article  MathSciNet  Google Scholar 

  4. Cameron AC, Trivedi PK (1986) Econometric models based on count data: comparisons and applications of some estimators and tests. J Appl Econ 1(1):29–53

    Article  Google Scholar 

  5. Abiodun GJ, Makinde OS, Adeola AM, Njabo KY, Witbooi PJ, Djidjou-Demasse R, Botai, JO (2000) A dynamical and zero-inflated negative binomial regression modelling of malaria incidence in Limpopo Province, South Africa. Int J Env Res Pub He 16(11)

  6. Neelon B, O’Malley AJ, Smith VA (2016) Modeling zero-modified count and semicontinuous data in health services research part 1: background and overview. Stat Med 35(27):5070–5093

    Article  MathSciNet  Google Scholar 

  7. Rose CE, Martin SW, Wannemuehler KA, Plikaytis BD (2006) On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat 16(4):463–481

    Article  MathSciNet  Google Scholar 

  8. Xu X, Ye T, Chu D (2021) Generalized zero-adjusted models to predict medical expenditures. Comput Intell Neurosci

  9. Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc 1–13

  10. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Routledge, New York

    Google Scholar 

  11. Frölich M (2006) Non-parametric regression for binary dependent variables. Econ J 9(3):511–540

    MathSciNet  Google Scholar 

  12. Mullahy J (1986) Specification and testing of some modified count data models. J Econ 33(3):341–365

    Article  MathSciNet  Google Scholar 

  13. Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14

    Article  Google Scholar 

  14. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  15. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  16. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K (2015) Xgboost: extreme gradient boosting 1(4), 1–4. R package version 0.4-2

  17. Samson D, Thomas H (1987) Linear models as aids in insurance decision making: the estimation of automobile insurance claims. J Bus Res 15(3):247–256

    Article  Google Scholar 

  18. Greene WH (1994) Accounting for excess zeros and sample selection in Poisson and negative binomial regression models

  19. Cameron AC, Trivedi PK, Milne F, Piggott J (1988) A microeconometric model of the demand for health care and health insurance in Australia. Rev Econ Stud 55(1):85–106

    Article  Google Scholar 

  20. Dionne G, Vanasse C (1989) A generalization of automobile insurance rating models: the negative binomial distribution with a regression component. ASTIN Bull J IAA 19(2):199–212

    Article  Google Scholar 

  21. Willmot GE (1987) The Poisson-inverse Gaussian distribution as an alternative to the negative binomial. Scand Actuar J 1987(3–4):113–127

    Article  MathSciNet  Google Scholar 

  22. Bulmer MG (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 101–110

  23. Consul PC (1989) Generalized Poisson distributions: properties and applications

  24. Zou Y, Geedipally SR, Lord D (2013) Evaluating the double Poisson generalized linear model. Accid Anal Prev 59:497–505

    Article  Google Scholar 

  25. Sellers KF, Shmueli G (2010) A flexible regression model for count data. Ann Appl Stat 943–961

  26. Yip KC, Yau KK (2005) On modeling claim frequency data in general insurance with extra zeros. Insur Math Econ 36(2):153–163

    Article  Google Scholar 

  27. Neelon BH, O’Malley AJ, Normand SLT (2010) A Bayesian model for repeated measures zero-inflated count data with application to outpatient psychiatric service use. Stat Modell 10(4):421–439

    Article  MathSciNet  Google Scholar 

  28. Preisser JS, Das K, Long DL, Divaris K (2016) Marginalized zero-inflated negative binomial regression with application to dental caries. Stat Med 35(10):1722–1735

    Article  MathSciNet  Google Scholar 

  29. Liu X, Zhang B, Tang L, Zhang Z, Zhang N, Allison JJ, Srivastava DK, Zhang H (2018) Are marginalized two-part models superior to non-marginalized two-part models for count data with excess zeroes? estimation of marginal effects, model misspecification, and model selection. Health Serv Outcomes Res Method 18(3):175–214

    Article  Google Scholar 

  30. Chen K, Huang R, Chan NH, Yau CY (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insur Math Econ 86:8–18

    Article  MathSciNet  Google Scholar 

  31. Gurmu S (1998) Generalized hurdle count data regression models. Econ Lett 58(3):263–268

    Article  Google Scholar 

  32. Ehsan Saffari S, Adnan R, Greene W (2012) Hurdle negative binomial regression model with right Cencored count data. Sort (Barc) 36(2):181–194

    Google Scholar 

  33. Baetschmann G, Winkelmann R (2014) A dynamic hurdle model for zero-inflated count data: with an application to health care utilization. Commun Stat Theory Methods (151)

  34. Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc

  35. Sakthivel KM, Rajitha CS (2017) Artificial intelligence for estimation of future claim frequency in non-life insurance. Glob J Pure Appl Math 13(6):1701–1710

    Google Scholar 

  36. Gao G, Wang H, Wüthrich MV (2022) Boosting Poisson regression models with telematics car driving data. Mach Learn 111(1):243–272

    Article  MathSciNet  Google Scholar 

  37. Liu Y, Wang BJ, Lv SG (2014) Using multi-class adaboost tree for prediction frequency of auto insurance. J Bank Financ 4(5):45

    Google Scholar 

  38. Lee SCK (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bull J IAA 51(1):27–55

    Article  MathSciNet  Google Scholar 

  39. Kong S, Bai J, Lee JH, Chen D, Allyn A, Stuart M, Pinsky M, Mills K, Gomes CP (2020) Deep hurdle networks for zero-inflated multi-target regression: application to multiple species abundance estimation. arXiv preprint arXiv:2010.16040

  40. Zhang P, Pitt D, Wu X (2022) A new multivariate zero-inflated hurdle model with applications in automobile insurance. ASTIN Bull J IAA 52(2):393–416

    Article  MathSciNet  Google Scholar 

  41. Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econ 12(3):313–336

    Article  Google Scholar 

  42. Gurmu S (1997) Semi-parametric estimation of hurdle regression models with an application to medicaid utilization. J Appl Econ (Chichester Engl) 12(3):225–242

    Article  Google Scholar 

  43. Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Economet 12(3):313–336

    Article  Google Scholar 

  44. Ribeiro MT, Singh S, Guestrin C (2016) "Why should i trust you?" Explaining the predictions of any classifier. arXiv-1602

  45. Shapley LS (1997) A value for n-person games. Classics in game theory 69

  46. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30

  47. Staniak M, Biecek P (2018) Explanations of model predictions with live and breakDown packages. arXiv preprint arXiv:1804.01955

  48. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 5:1189–1232

    MathSciNet  Google Scholar 

Download references

Acknowledgements

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by XX, TY, JG and DC. The first draft of the manuscript was written by XX and DC, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

Xin Xu acknowledges financial support from the National Social Science Foundation of China (22BTJ016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongxiao Chu.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, X., Ye, T., Gao, J. et al. Generalized hurdle count data models based on interpretable machine learning with an application to health care demand. Computing 106, 295–325 (2024). https://doi.org/10.1007/s00607-023-01224-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-023-01224-3

Keywords

Mathematics Subject Classification

Navigation