Generalized hurdle count data models based on interpretable machine learning with an application to health care demand

Xu, Xin; Ye, Tao; Gao, Jieying; Chu, Dongxiao

doi:10.1007/s00607-023-01224-3

Generalized hurdle count data models based on interpretable machine learning with an application to health care demand

Regular Paper
Published: 18 September 2023

Volume 106, pages 295–325, (2024)
Cite this article

Computing Aims and scope Submit manuscript

Xin Xu¹,
Tao Ye²^na1,
Jieying Gao¹^na1 &
…
Dongxiao Chu ORCID: orcid.org/0000-0002-1212-7653¹^na1

221 Accesses
Explore all metrics

Abstract

The zero-inflated count data model has long been viewed as an important research topic owing to its enormously different disciplines. As early classical statistical models of linear and logarithmic mean transformation are difficult to be consistent with reality, an enhanced hurdle model based on machine learning methods is proposed. The decision tree, random forest, support vector, and XGBoost methods are introduced in the two stages of the hurdle model. This framework allows to capture the decision-making behavior and predict the count more flexibly and accurately. The generalized hurdle model consists of traditional discrete distributions, which can fit under-dispersed, equi-dispersed, or over-dispersed count data. The extended hurdle models are utilized to fit health care data and compare their performance with traditional count models. The results show that the generalized hurdle model with random forest performs best. Variable importance, break-down plots, and partial plots provide better interpretability for the extended model, which makes the results more reliable and transparent. To the best of our knowledge, this is the first study to generalize the hurdle model with interpretable machine learning methods in count data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new regression model for count data with applications to health care data

Article 25 September 2023

Generalized Count Data Regression Models and Their Applications to Health Care Data

Article 16 May 2019

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

Article Open access 24 June 2021

References

Hartman M, Martin AB, Washington B, Catlin A (2022) National health expenditure accounts team: national health care spending in 2020: growth driven by federal spending in response to the COVID-19 pandemic: national health expenditures study examines US health care spending in 2020. Health Aff 41(1):13–25
Article Google Scholar
Rana RH, Alam K, Gow J (2021) Financial development and health expenditure nexus: a global perspective. Int J Financ Econ 26(1):1050–1063
Article Google Scholar
Chen T, Zhang H, Zhang B (2019) A semiparametric marginalized zero-inflated model for analyzing healthcare utilization panel data with missingness. J Appl Stat 46(16):2862–2883
Article MathSciNet Google Scholar
Cameron AC, Trivedi PK (1986) Econometric models based on count data: comparisons and applications of some estimators and tests. J Appl Econ 1(1):29–53
Article Google Scholar
Abiodun GJ, Makinde OS, Adeola AM, Njabo KY, Witbooi PJ, Djidjou-Demasse R, Botai, JO (2000) A dynamical and zero-inflated negative binomial regression modelling of malaria incidence in Limpopo Province, South Africa. Int J Env Res Pub He 16(11)
Neelon B, O’Malley AJ, Smith VA (2016) Modeling zero-modified count and semicontinuous data in health services research part 1: background and overview. Stat Med 35(27):5070–5093
Article MathSciNet Google Scholar
Rose CE, Martin SW, Wannemuehler KA, Plikaytis BD (2006) On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat 16(4):463–481
Article MathSciNet Google Scholar
Xu X, Ye T, Chu D (2021) Generalized zero-adjusted models to predict medical expenditures. Comput Intell Neurosci
Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc 1–13
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Routledge, New York
Google Scholar
Frölich M (2006) Non-parametric regression for binary dependent variables. Econ J 9(3):511–540
MathSciNet Google Scholar
Mullahy J (1986) Specification and testing of some modified count data models. J Econ 33(3):341–365
Article MathSciNet Google Scholar
Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K (2015) Xgboost: extreme gradient boosting 1(4), 1–4. R package version 0.4-2
Samson D, Thomas H (1987) Linear models as aids in insurance decision making: the estimation of automobile insurance claims. J Bus Res 15(3):247–256
Article Google Scholar
Greene WH (1994) Accounting for excess zeros and sample selection in Poisson and negative binomial regression models
Cameron AC, Trivedi PK, Milne F, Piggott J (1988) A microeconometric model of the demand for health care and health insurance in Australia. Rev Econ Stud 55(1):85–106
Article Google Scholar
Dionne G, Vanasse C (1989) A generalization of automobile insurance rating models: the negative binomial distribution with a regression component. ASTIN Bull J IAA 19(2):199–212
Article Google Scholar
Willmot GE (1987) The Poisson-inverse Gaussian distribution as an alternative to the negative binomial. Scand Actuar J 1987(3–4):113–127
Article MathSciNet Google Scholar
Bulmer MG (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 101–110
Consul PC (1989) Generalized Poisson distributions: properties and applications
Zou Y, Geedipally SR, Lord D (2013) Evaluating the double Poisson generalized linear model. Accid Anal Prev 59:497–505
Article Google Scholar
Sellers KF, Shmueli G (2010) A flexible regression model for count data. Ann Appl Stat 943–961
Yip KC, Yau KK (2005) On modeling claim frequency data in general insurance with extra zeros. Insur Math Econ 36(2):153–163
Article Google Scholar
Neelon BH, O’Malley AJ, Normand SLT (2010) A Bayesian model for repeated measures zero-inflated count data with application to outpatient psychiatric service use. Stat Modell 10(4):421–439
Article MathSciNet Google Scholar
Preisser JS, Das K, Long DL, Divaris K (2016) Marginalized zero-inflated negative binomial regression with application to dental caries. Stat Med 35(10):1722–1735
Article MathSciNet Google Scholar
Liu X, Zhang B, Tang L, Zhang Z, Zhang N, Allison JJ, Srivastava DK, Zhang H (2018) Are marginalized two-part models superior to non-marginalized two-part models for count data with excess zeroes? estimation of marginal effects, model misspecification, and model selection. Health Serv Outcomes Res Method 18(3):175–214
Article Google Scholar
Chen K, Huang R, Chan NH, Yau CY (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insur Math Econ 86:8–18
Article MathSciNet Google Scholar
Gurmu S (1998) Generalized hurdle count data regression models. Econ Lett 58(3):263–268
Article Google Scholar
Ehsan Saffari S, Adnan R, Greene W (2012) Hurdle negative binomial regression model with right Cencored count data. Sort (Barc) 36(2):181–194
Google Scholar
Baetschmann G, Winkelmann R (2014) A dynamic hurdle model for zero-inflated count data: with an application to health care utilization. Commun Stat Theory Methods (151)
Xu X, Chu D (2021) Modeling hospitalization decision and utilization for the elderly in China. Discrete Dyn Nat Soc
Sakthivel KM, Rajitha CS (2017) Artificial intelligence for estimation of future claim frequency in non-life insurance. Glob J Pure Appl Math 13(6):1701–1710
Google Scholar
Gao G, Wang H, Wüthrich MV (2022) Boosting Poisson regression models with telematics car driving data. Mach Learn 111(1):243–272
Article MathSciNet Google Scholar
Liu Y, Wang BJ, Lv SG (2014) Using multi-class adaboost tree for prediction frequency of auto insurance. J Bank Financ 4(5):45
Google Scholar
Lee SCK (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bull J IAA 51(1):27–55
Article MathSciNet Google Scholar
Kong S, Bai J, Lee JH, Chen D, Allyn A, Stuart M, Pinsky M, Mills K, Gomes CP (2020) Deep hurdle networks for zero-inflated multi-target regression: application to multiple species abundance estimation. arXiv preprint arXiv:2010.16040
Zhang P, Pitt D, Wu X (2022) A new multivariate zero-inflated hurdle model with applications in automobile insurance. ASTIN Bull J IAA 52(2):393–416
Article MathSciNet Google Scholar
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econ 12(3):313–336
Article Google Scholar
Gurmu S (1997) Semi-parametric estimation of hurdle regression models with an application to medicaid utilization. J Appl Econ (Chichester Engl) 12(3):225–242
Article Google Scholar
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Economet 12(3):313–336
Article Google Scholar
Ribeiro MT, Singh S, Guestrin C (2016) "Why should i trust you?" Explaining the predictions of any classifier. arXiv-1602
Shapley LS (1997) A value for n-person games. Classics in game theory 69
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30
Staniak M, Biecek P (2018) Explanations of model predictions with live and breakDown packages. arXiv preprint arXiv:1804.01955
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 5:1189–1232
MathSciNet Google Scholar

Download references

Acknowledgements

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by XX, TY, JG and DC. The first draft of the manuscript was written by XX and DC, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

Xin Xu acknowledges financial support from the National Social Science Foundation of China (22BTJ016).

Author information

Tao Ye, Jieying Gao and Dongxiao Chu have contributed equally to this work.

Authors and Affiliations

School of Finance, Capital University of Economics and Business, Beijing, 100070, China
Xin Xu, Jieying Gao & Dongxiao Chu
School of Banking and Finance, University of International Business and Economics, Beijing, 100029, China
Tao Ye

Authors

Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Ye
View author publications
You can also search for this author in PubMed Google Scholar
Jieying Gao
View author publications
You can also search for this author in PubMed Google Scholar
Dongxiao Chu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongxiao Chu.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, X., Ye, T., Gao, J. et al. Generalized hurdle count data models based on interpretable machine learning with an application to health care demand. Computing 106, 295–325 (2024). https://doi.org/10.1007/s00607-023-01224-3

Download citation

Received: 18 July 2022
Accepted: 04 September 2023
Published: 18 September 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00607-023-01224-3

Keywords

Mathematics Subject Classification

62P10

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized hurdle count data models based on interpretable machine learning with an application to health care demand

Abstract

Access this article

Similar content being viewed by others

A new regression model for count data with applications to health care data

Generalized Count Data Regression Models and Their Applications to Health Care Data

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Generalized hurdle count data models based on interpretable machine learning with an application to health care demand

Abstract

Access this article

Similar content being viewed by others

A new regression model for count data with applications to health care data

Generalized Count Data Regression Models and Their Applications to Health Care Data

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation