Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

Johnson, Justin M.; Khoshgoftaar, Taghi M.

doi:10.1007/s42979-022-01252-4

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

Original Research
Published: 05 July 2022

Volume 3, article number 362, (2022)
Cite this article

SN Computer Science Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Machine learning applications for healthcare are reshaping the industry with new tools and services designed to improve the quality of patient care. A challenge common to many of these applications is encoding healthcare procedure codes, a high-cardinality categorical variable containing thousands of unique values. Traditional one-hot encoding techniques produce sparse binary vectors that drastically increase the size of data sets. Aggregation methods compress data to lower dimensions using summary statistics but risk forfeiting valuable information from predictive models. We compare these encoding techniques for healthcare procedure codes using two Medicare fraud classification data sets and five popular machine learning algorithms to determine how the inclusion of procedure codes affects classification performance. Next, LightGBM’s and CatBoost’s built-in methods for categorical feature handling are compared to Hcpcs2Vec embeddings, which are distributed representations of procedures that encode semantic similarities. Statistical tests show that the inclusion of the procedure code feature significantly improves performance when a one-hot representation is not used. The Hcpcs2Vec and LightGBM encoding techniques consistently perform best and second best, respectively, and outperform the one-hot and aggregate encoding methods. Feature importance measures and embedding visualizations show that the Hcpcs2Vec encodings capture key semantic qualities of procedure codes and increase the importance of the procedure code variable by an order of magnitude. These qualities make the Hcpcs2Vec procedure code embeddings appropriate for future works in healthcare.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Medical Provider Embeddings for Healthcare Fraud Detection

Article 15 May 2021

Mining for Health: A Comparison of Word Embedding Methods for Analysis of EHRs Data

Identifying Hidden Patterns from Health Administrative Claims by Means of “HAC2Vec” Embedding

References

Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43.
Article Google Scholar
Johnson KB, Wei W-Q, Weeraratne D, Frisse ME, Misulis K, Rhee K, Zhao J, Snowdon JL. Precision medicine, AI, and the future of personalized health care. Clin Transl Sci. 2021;14(1):86–93. https://doi.org/10.1111/cts.12884.
Article Google Scholar
Hafiz AM, Bhat GM. A survey of deep learning techniques for medical diagnosis. In: Tuba M, Akashe S, Joshi A, editors. Information and communication technology for sustainable development. Singapore: Springer; 2020. p. 161–70.
Chapter Google Scholar
Jeyaraj PR, Nadar ERS. Smart-monitor: patient monitoring system for iot-based healthcare system using deep learning. IETE J Res. 2019. https://doi.org/10.1080/03772063.2019.1649215.
Article Google Scholar
Mulani J, Heda S, Tumdi K, Patel J, Chhinkaniwala H, Patel J. In: Dash S, Acharya BR, Mittal M, Abraham A, Kelemen A, editors. Deep reinforcement learning based personalized health recommendations. Cham: Springer; 2020. p. 231–55. https://doi.org/10.1007/978-3-030-33966-1_12.
Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):1–10.
Article Google Scholar
Sadilek A, Caty S, DiPrete L, Mansour R, Schenk T, Bergtholdt M, Jha A, Ramaswami P, Gabrilovich E. Machine-learned epidemiology: real-time detection of foodborne illness at scale. NPJ Digit Med. 2018;1(1):1–7.
Article Google Scholar
U.S. Government, U.S. Centers for Medicare & Medicaid Services: The Official U.S. Government Site for Medicare. https://www.medicare.gov/. Accessed 01 Oct 2021.
Centers for Medicare & Medicaid Services: Medicare Enrollment Dashboard. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Enrollment/Enrollment%20Dashboard.html. Accessed 15 Oct 2020
Centers For Medicare & Medicaid Services: Trustees Report & Trust Funds. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html. Accessed 15 Oct 2020
Morris L. Combating fraud in health care: an essential component of any cost containment strategy. Health Aff (Project Hope). 2009;28:1351–6. https://doi.org/10.1377/hlthaff.28.5.1351.
Article Google Scholar
Medicare Fraud & Abuse: Prevention, Detection, and Reporting. Centers for Medicare & Medicaid Services (2017). https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/Fraud_and_Abuse.pdf. Accessed 15 Oct 2020
Medicare Provider Utilization and Payment Data. Centers for Medicare & Medicaid Services (2021). https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index. Accessed 01 Oct 2021
Witten IH, Frank E, Hall MA, Pal CJ. Data mining: practical machine learning tools and techniques. 4th ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2016.
Google Scholar
Chen L. In: Liu L, Özsu MT, editors. Curse of dimensionality. Boston: Springer; 2009. p. 545–546. https://doi.org/10.1007/978-0-387-39940-9_133
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29. https://doi.org/10.1186/s40537-018-0138-3.
Article Google Scholar
Johnson JM, Khoshgoftaar TM. Hcpcs2vec: Healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC); 2020.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Curran Associates Inc., Red Hook, NY, USA; 2017. p. 3149–3157.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. In: Proceedings of the 32nd international conference on neural information processing systems. NIPS’18. Curran Associates Inc., Red Hook, NY, USA; 2018. p. 6639–6649.
Johnson JM, Khoshgoftaar TM. Encoding techniques for high-cardinality features and ensemble learners. In: 2021 IEEE 22nd international conference on information reuse and integration for data science (IRI); 2021.
Chen T, Guestrin C. Xgboost. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. https://doi.org/10.1145/2939672.2939785
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. http://www.deeplearningbook.org
Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114.
Article MathSciNet Google Scholar
Akbar NA, Sunyoto A, Rudyanto Arief M, Caesarendra W. Improvement of decision tree classifier accuracy for healthcare insurance fraud prediction by using extreme gradient boosting algorithm. In: 2020 International conference on informatics, multimedia, cyber and information system (ICIMCIS); 2020. p. 110–114. https://doi.org/10.1109/ICIMCIS51567.2020.9354286
Rohit AG. Healthcare provider fraud detection analysis. https://www.kaggle.com/rohitrox/medical-provider-fraud-detection/data. Accessed 01 Oct 2021.
Bauder R, da Rosa R, Khoshgoftaar T. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE international conference on information reuse and integration (IRI); 2018. p. 285–292. https://doi.org/10.1109/IRI.2018.00051
Ko J, Chalfin H, Trock B, Feng Z, Humphreys E, Park S-W, Carter B, Frick KD, Han M. Variability in medicare utilization and payment among urologists. Urology. 2015. https://doi.org/10.1016/j.urology.2014.11.054.
Article Google Scholar
Herland M, Bauder RA, Khoshgoftaar TM. Approaches for identifying US medicare fraud in provider claims data. Health Care Manag Sci. 2020;23(1):2–19. https://doi.org/10.1007/s10729-018-9460-8.
Article Google Scholar
Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016. p. 845–851. https://doi.org/10.1109/ASONAM.2016.7752336
Chandola V, Sukumar SR, Schryver JC. Knowledge discovery from massive healthcare claims data. In: KDD; 2013.
Choi Y, Chiu CY-I, Sontag DA. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016;2016:41–50.
Google Scholar
Moeyersoms J, Martens D. Including high-cardinality attributes in predictive models: a case study in churn prediction in the energy sector. Decis Support Syst. 2015;72:72–81. https://doi.org/10.1016/j.dss.2015.02.007.
Article Google Scholar
De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management. CIKM’14. New York: Association for Computing Machinery; 2014. p. 1819–1822. https://doi.org/10.1145/2661829.2661974
Beam AL, Kompa B, Fried I, Palmer NP, Shi X, Cai T, Kohane IS. Clinical concept embeddings learned from massive sources of medical data; 2018. arXiv:abs/1804.01486.
Centers For Medicare & Medicaid Services: Medicare Provider Utilization and Payment Data. https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data. Accessed 15 Oct 2020.
Office of Inspector General: LEIE Downloadable Databases. https://oig.hhs.gov/exclusions/exclusions_list.asp. Accessed 01 Oct 2021.
Office of Inspector General: Exclusion Authorities. https://oig.hhs.gov/exclusions/authorities.asp. Accessed 15 Oct 2020
Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th international conference on information reuse and integration (IRI); 2016. p. 11–19. https://doi.org/10.1109/IRI.2016.11
Potdar K, Pardawala TS, Pai CD. A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl. 2017;175(4):7–9.
Google Scholar
Fisher WD. On grouping for maximum homogeneity. J Am Stat Assoc. 1958;53(284):789–98. https://doi.org/10.1080/01621459.1958.10501479.
Article MathSciNet MATH Google Scholar
Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space; 2013. CoRR arXiv:abs/1301.3781.
Harris ZS. Distributional structure. Word. 1954;10(2–3):146–62. https://doi.org/10.1080/00437956.1954.11659520.
Article Google Scholar
Řehůřek R, Sojka P. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta; 2010. p. 45–50. http://is.muni.cz/publication/884893/en

Download references

Funding

Not applicable

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, USA
Justin M. Johnson & Taghi M. Khoshgoftaar

Authors

Justin M. Johnson
View author publications
You can also search for this author inPubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

JMJ performed the literature review, executed the experiment design, and drafted the manuscript. TMK worked with JMJ to develop the article’s framework and focus. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Justin M. Johnson.

Ethics declarations

Conflict of interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Innovative AI in Medical Applications” guest edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, J.M., Khoshgoftaar, T.M. Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection. SN COMPUT. SCI. 3, 362 (2022). https://doi.org/10.1007/s42979-022-01252-4

Download citation

Received: 17 December 2021
Accepted: 15 June 2022
Published: 05 July 2022
DOI: https://doi.org/10.1007/s42979-022-01252-4

Keywords

Part of a collection:

Innovative AI in Medical Applications

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Medical Provider Embeddings for Healthcare Fraud Detection

Mining for Health: A Comparison of Word Embedding Methods for Analysis of EHRs Data

Identifying Hidden Patterns from Health Administrative Claims by Means of “HAC2Vec” Embedding

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now