Abstract
The zero-inflated count data model has long been viewed as an important research topic owing to its enormously different disciplines. As early classical statistical models of linear and logarithmic mean transformation are difficult to be consistent with reality, an enhanced hurdle model based on machine learning methods is proposed. The decision tree, random forest, support vector, and XGBoost methods are introduced in the two stages of the hurdle model. This framework allows to capture the decision-making behavior and predict the count more flexibly and accurately. The generalized hurdle model consists of traditional discrete distributions, which can fit under-dispersed, equi-dispersed, or over-dispersed count data. The extended hurdle models are utilized to fit health care data and compare their performance with traditional count models. The results show that the generalized hurdle model with random forest performs best. Variable importance, break-down plots, and partial plots provide better interpretability for the extended model, which makes the results more reliable and transparent. To the best of our knowledge, this is the first study to generalize the hurdle model with interpretable machine learning methods in count data.
Similar content being viewed by others
References
Hartman M, Martin AB, Washington B, Catlin A (2022) National health expenditure accounts team: national health care spending in 2020: growth driven by federal spending in response to the COVID-19 pandemic: national health expenditures study examines US health care spending in 2020. Health Aff 41(1):13–25
Rana RH, Alam K, Gow J (2021) Financial development and health expenditure nexus: a global perspective. Int J Financ Econ 26(1):1050–1063
Chen T, Zhang H, Zhang B (2019) A semiparametric marginalized zero-inflated model for analyzing healthcare utilization panel data with missingness. J Appl Stat 46(16):2862–2883
Cameron AC, Trivedi PK (1986) Econometric models based on count data: comparisons and applications of some estimators and tests. J Appl Econ 1(1):29–53
Abiodun GJ, Makinde OS, Adeola AM, Njabo KY, Witbooi PJ, Djidjou-Demasse R, Botai, JO (2000) A dynamical and zero-inflated negative binomial regression modelling of malaria incidence in Limpopo Province, South Africa. Int J Env Res Pub He 16(11)
Neelon B, O’Malley AJ, Smith VA (2016) Modeling zero-modified count and semicontinuous data in health services research part 1: background and overview. Stat Med 35(27):5070–5093
Rose CE, Martin SW, Wannemuehler KA, Plikaytis BD (2006) On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat 16(4):463–481
Xu X, Ye T, Chu D (2021) Generalized zero-adjusted models to predict medical expenditures. Comput Intell Neurosci
Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc 1–13
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Routledge, New York
Frölich M (2006) Non-parametric regression for binary dependent variables. Econ J 9(3):511–540
Mullahy J (1986) Specification and testing of some modified count data models. J Econ 33(3):341–365
Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K (2015) Xgboost: extreme gradient boosting 1(4), 1–4. R package version 0.4-2
Samson D, Thomas H (1987) Linear models as aids in insurance decision making: the estimation of automobile insurance claims. J Bus Res 15(3):247–256
Greene WH (1994) Accounting for excess zeros and sample selection in Poisson and negative binomial regression models
Cameron AC, Trivedi PK, Milne F, Piggott J (1988) A microeconometric model of the demand for health care and health insurance in Australia. Rev Econ Stud 55(1):85–106
Dionne G, Vanasse C (1989) A generalization of automobile insurance rating models: the negative binomial distribution with a regression component. ASTIN Bull J IAA 19(2):199–212
Willmot GE (1987) The Poisson-inverse Gaussian distribution as an alternative to the negative binomial. Scand Actuar J 1987(3–4):113–127
Bulmer MG (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 101–110
Consul PC (1989) Generalized Poisson distributions: properties and applications
Zou Y, Geedipally SR, Lord D (2013) Evaluating the double Poisson generalized linear model. Accid Anal Prev 59:497–505
Sellers KF, Shmueli G (2010) A flexible regression model for count data. Ann Appl Stat 943–961
Yip KC, Yau KK (2005) On modeling claim frequency data in general insurance with extra zeros. Insur Math Econ 36(2):153–163
Neelon BH, O’Malley AJ, Normand SLT (2010) A Bayesian model for repeated measures zero-inflated count data with application to outpatient psychiatric service use. Stat Modell 10(4):421–439
Preisser JS, Das K, Long DL, Divaris K (2016) Marginalized zero-inflated negative binomial regression with application to dental caries. Stat Med 35(10):1722–1735
Liu X, Zhang B, Tang L, Zhang Z, Zhang N, Allison JJ, Srivastava DK, Zhang H (2018) Are marginalized two-part models superior to non-marginalized two-part models for count data with excess zeroes? estimation of marginal effects, model misspecification, and model selection. Health Serv Outcomes Res Method 18(3):175–214
Chen K, Huang R, Chan NH, Yau CY (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insur Math Econ 86:8–18
Gurmu S (1998) Generalized hurdle count data regression models. Econ Lett 58(3):263–268
Ehsan Saffari S, Adnan R, Greene W (2012) Hurdle negative binomial regression model with right Cencored count data. Sort (Barc) 36(2):181–194
Baetschmann G, Winkelmann R (2014) A dynamic hurdle model for zero-inflated count data: with an application to health care utilization. Commun Stat Theory Methods (151)
Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc
Sakthivel KM, Rajitha CS (2017) Artificial intelligence for estimation of future claim frequency in non-life insurance. Glob J Pure Appl Math 13(6):1701–1710
Gao G, Wang H, Wüthrich MV (2022) Boosting Poisson regression models with telematics car driving data. Mach Learn 111(1):243–272
Liu Y, Wang BJ, Lv SG (2014) Using multi-class adaboost tree for prediction frequency of auto insurance. J Bank Financ 4(5):45
Lee SCK (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bull J IAA 51(1):27–55
Kong S, Bai J, Lee JH, Chen D, Allyn A, Stuart M, Pinsky M, Mills K, Gomes CP (2020) Deep hurdle networks for zero-inflated multi-target regression: application to multiple species abundance estimation. arXiv preprint arXiv:2010.16040
Zhang P, Pitt D, Wu X (2022) A new multivariate zero-inflated hurdle model with applications in automobile insurance. ASTIN Bull J IAA 52(2):393–416
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econ 12(3):313–336
Gurmu S (1997) Semi-parametric estimation of hurdle regression models with an application to medicaid utilization. J Appl Econ (Chichester Engl) 12(3):225–242
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Economet 12(3):313–336
Ribeiro MT, Singh S, Guestrin C (2016) "Why should i trust you?" Explaining the predictions of any classifier. arXiv-1602
Shapley LS (1997) A value for n-person games. Classics in game theory 69
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30
Staniak M, Biecek P (2018) Explanations of model predictions with live and breakDown packages. arXiv preprint arXiv:1804.01955
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 5:1189–1232
Acknowledgements
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by XX, TY, JG and DC. The first draft of the manuscript was written by XX and DC, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding
Xin Xu acknowledges financial support from the National Social Science Foundation of China (22BTJ016).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, X., Ye, T., Gao, J. et al. Generalized hurdle count data models based on interpretable machine learning with an application to health care demand. Computing 106, 295–325 (2024). https://doi.org/10.1007/s00607-023-01224-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-023-01224-3