research-article

Retrieval-Based Gradient Boosting Decision Trees for Disease Risk Assessment

Authors:

Shaodian Zhang,

Yong YuAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 3468 - 3476

https://doi.org/10.1145/3534678.3539052

Published: 14 August 2022 Publication History

Abstract

In recent years, machine learning methods have been widely used in modern electronic health record (EHR) systems, and have shown more accurate prediction performance on disease risk assessment tasks than traditional methods. However, most of the existing machine learning methods make the assessment solely based on features of the target case but ignore the cross-sample feature interactions between the target case and other similar cases, which is inconsistent with the general practice of evidence-based medicine of making diagnoses based on existing clinical experience. Moreover, current methods that focus on mining cross-sample information rely on deep neural networks to extract cross-sample feature interactions, which would suffer from the problems of data insufficiency, data heterogeneity and lack of interpretability in disease risk assessment tasks. In this work, we propose a novel retrieval-based gradient boosting decision trees (RB-GBDT) model with a cross-sample extractor to mine cross-sample information while exploiting the superiority of GBDT of robustness, generalization and interpretability. Experiments on real-world clinical datasets show the superiority and efficacy of RB-GBDT on disease risk assessment tasks. The developed software has been deployed in hospital as an auxiliary diagnosis tool for risk assessment of venous thromboembolism.

References

[1]

S Barbar, F Noventa, V Rossetto, A Ferrari, B Brandolin, M Perlati, E De Bon, D Tormene, A Pagnan, and P Prandoni. 2010. A risk assessment model for the identification of hospitalized medical patients at risk for venous thromboembolism: the Padua Prediction Score. Journal of Thrombosis and Haemostasis (2010).

[2]

Joseph A Caprini. 2005. Thrombosis risk assessment as a guide to quality patient care. Disease-a-Month (2005).

[3]

Mingcheng Chen, Zhenghui Wang, Zhiyun Zhao, Weinan Zhang, Xiawei Guo, Jian Shen, Yanru Qu, Jieli Lu, Min Xu, Yu Xu, et al. 2021. Task-wise Split Gradient Boosting Trees for Multi-center Diabetes Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2663--2673.

Digital Library

[4]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD .

[5]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP .

[6]

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS .

Digital Library

[7]

William S Cleveland and Susan J Devlin. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association (1988).

[8]

Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature (2012).

[9]

Sahibsingh A Dudani. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics 4 (1976), 325--327.

[10]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001).

[11]

Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient nearest neighbor language models. arXiv preprint arXiv:2109.04212 (2021).

[12]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448--456.

[13]

James M Keller, Michael R Gray, and James A Givens. 1985. A fuzzy k-nearest neighbor algorithm. IEEE transactions on systems, man, and cybernetics 4 (1985), 580--585.

[14]

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2021. Nearest neighbor machine translation. ICLR (2021).

[15]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. ICLR (2020).

[16]

Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 689--698.

Digital Library

[17]

Huiying Liang, Brian Y Tsui, Hao Ni, Carolina CS Valentim, Sally L Baxter, Guangjian Liu, Wenjia Cai, Daniel S Kermany, Xin Sun, Jiancong Chen, et al. 2019. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nature medicine (2019).

[18]

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole. SIGKDD (Aug 2017). https://doi.org/10.1145/3097983.3098088

Digital Library

[19]

Handong Ma, Wenbo Sheng, Jiyu Li, Lengchen Hou, Jiafang Yang, Junjie Cai, Wenxiang Xu, and Shaodian Zhang. 2021. A novel hierarchical machine learning model for hospital-acquired venous thromboembolism risk assessment among multiple-departments. Journal of Biomedical Informatics (2021).

[20]

Izet Masic, Milan Miokovic, and Belma Muhamedagic. 2008. Evidence based medicine--new approaches and challenges. Acta Informatica Medica (2008).

[21]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Digital Library

[22]

Bernard Pereira, Suet-Feung Chin, Oscar M Rueda, Hans-Kristian Moen Vollan, Elena Provenzano, Helen A Bardwell, Michelle Pugh, Linda Jones, Roslin Russell, Stephen-John Sammut, et al. 2016. The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature communications (2016).

[23]

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685--2692.

Digital Library

[24]

Tobias Plötz and Stefan Roth. 2018. Neural nearest neighbors networks. arXiv preprint arXiv:1810.12575 (2018).

[25]

Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379--1389.

Digital Library

[26]

Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, and Yong Yu. 2020. User behavior retrieval for click-through rate prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2347--2356.

Digital Library

[27]

Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In ICDM.

[28]

David L Sackett. 1997. Evidence-based medicine. In Seminars in perinatology, Vol. 21. Elsevier, 3--5.

[29]

Nick van Es, Marcello Di Nisio, Gabriela Cesarman, Ankie Kleinjan, Hans-Martin Otten, Isabelle Mahé, Ineke T Wilts, Desirée C Twint, Ettore Porreca, Oscar Arrieta, et al. 2017. Comparison of risk prediction scores for venous thromboembolism in cancer patients: a prospective cohort study. haematologica (2017).

[30]

Jake Zhao and Kyunghyun Cho. 2018. Retrieval-augmented convolutional neural networks for improved robustness against adversarial examples. arXiv preprint arXiv:1802.09502 (2018).

[31]

Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst., Vol. 23 (1998), 453--490.

Digital Library

Cited By

Cai QZheng KJagadish HOoi BYip J(2024)CohortNet: Empowering Cohort Discovery for Interpretable Healthcare AnalyticsProceedings of the VLDB Endowment10.14778/3675034.367504117:10(2487-2500)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675041
Chen MZhao HZhao YFan HGao HYu YTian Z(2023)ROMO: Retrieval-enhanced Offline Model-based OptimizationProceedings of the Fifth International Conference on Distributed Artificial Intelligence10.1145/3627676.3627685(1-9)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3627676.3627685
Cao HJiang HYang KChen SWu WLiu JDustdar S(2023)Data-Augmentation-Enabled Continuous User Authentication via Passive Vibration ResponseIEEE Internet of Things Journal10.1109/JIOT.2023.326427410:16(14137-14151)Online publication date: 15-Aug-2023
https://doi.org/10.1109/JIOT.2023.3264274

Index Terms

Retrieval-Based Gradient Boosting Decision Trees for Disease Risk Assessment
1. Applied computing
  1. Life and medical sciences
    1. Health care information systems
2. Theory of computation
  1. Computational complexity and cryptography
    1. Oracles and decision trees

Recommendations

Performance improvement of atherosclerosis risk assessment based on feature interaction
Abstract Background and objective
Cardiovascular disease is a leading cause of mortality and premature death. Early intervention in asymptomatic individuals through risk assessment can reduce the incidence of disease. Atherosclerosis is a major cause of ...
Highlights
- We proposed a three main phases risk assessment model.
- We considered interreaction factor into assessment model.
- To support the prevention and treatment of cardiovascular diseases and to assist doctors in diagnosis.
An agent-based self risk assessment and monitoring system for cardiovascular disease patients
Telehealth '07: The Third IASTED International Conference on Telehealth

Cardiovascular disease has become one of the most critical chronic diseases today. In this paper, we proposed an agent-based system for cardiovascular disease patients to assess the risk and to monitor their health situation by themselves. There are two ...
Novel framework of significant risk factor identification and cardiovascular disease prediction
Abstract
Cardiovascular disease (CVD) remains a major public health concern, characterized by high mortality rates and complex diagnostic challenges. Risk factor-based prediction models are commonly employed, but existing approaches often treat all ...
Highlights
- A hybrid statistical approach to extract optimum key risk factors of heart disease.
- Substantiate the efficacy of optimum key risk factors with statistical measures.
- Implement Stacked Meta Neural Network classifier to predict heart ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shanghai Municipal Science and Technology Major Project

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
460
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)9

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cai QZheng KJagadish HOoi BYip J(2024)CohortNet: Empowering Cohort Discovery for Interpretable Healthcare AnalyticsProceedings of the VLDB Endowment10.14778/3675034.367504117:10(2487-2500)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675041
Chen MZhao HZhao YFan HGao HYu YTian Z(2023)ROMO: Retrieval-enhanced Offline Model-based OptimizationProceedings of the Fifth International Conference on Distributed Artificial Intelligence10.1145/3627676.3627685(1-9)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3627676.3627685
Cao HJiang HYang KChen SWu WLiu JDustdar S(2023)Data-Augmentation-Enabled Continuous User Authentication via Passive Vibration ResponseIEEE Internet of Things Journal10.1109/JIOT.2023.326427410:16(14137-14151)Online publication date: 15-Aug-2023
https://doi.org/10.1109/JIOT.2023.3264274

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten