skip to main content
10.1145/3534678.3539052acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Retrieval-Based Gradient Boosting Decision Trees for Disease Risk Assessment

Published: 14 August 2022 Publication History

Abstract

In recent years, machine learning methods have been widely used in modern electronic health record (EHR) systems, and have shown more accurate prediction performance on disease risk assessment tasks than traditional methods. However, most of the existing machine learning methods make the assessment solely based on features of the target case but ignore the cross-sample feature interactions between the target case and other similar cases, which is inconsistent with the general practice of evidence-based medicine of making diagnoses based on existing clinical experience. Moreover, current methods that focus on mining cross-sample information rely on deep neural networks to extract cross-sample feature interactions, which would suffer from the problems of data insufficiency, data heterogeneity and lack of interpretability in disease risk assessment tasks. In this work, we propose a novel retrieval-based gradient boosting decision trees (RB-GBDT) model with a cross-sample extractor to mine cross-sample information while exploiting the superiority of GBDT of robustness, generalization and interpretability. Experiments on real-world clinical datasets show the superiority and efficacy of RB-GBDT on disease risk assessment tasks. The developed software has been deployed in hospital as an auxiliary diagnosis tool for risk assessment of venous thromboembolism.

References

[1]
S Barbar, F Noventa, V Rossetto, A Ferrari, B Brandolin, M Perlati, E De Bon, D Tormene, A Pagnan, and P Prandoni. 2010. A risk assessment model for the identification of hospitalized medical patients at risk for venous thromboembolism: the Padua Prediction Score. Journal of Thrombosis and Haemostasis (2010).
[2]
Joseph A Caprini. 2005. Thrombosis risk assessment as a guide to quality patient care. Disease-a-Month (2005).
[3]
Mingcheng Chen, Zhenghui Wang, Zhiyun Zhao, Weinan Zhang, Xiawei Guo, Jian Shen, Yanru Qu, Jieli Lu, Min Xu, Yu Xu, et al. 2021. Task-wise Split Gradient Boosting Trees for Multi-center Diabetes Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2663--2673.
[4]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD .
[5]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP .
[6]
Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS .
[7]
William S Cleveland and Susan J Devlin. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association (1988).
[8]
Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature (2012).
[9]
Sahibsingh A Dudani. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics 4 (1976), 325--327.
[10]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001).
[11]
Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient nearest neighbor language models. arXiv preprint arXiv:2109.04212 (2021).
[12]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448--456.
[13]
James M Keller, Michael R Gray, and James A Givens. 1985. A fuzzy k-nearest neighbor algorithm. IEEE transactions on systems, man, and cybernetics 4 (1985), 580--585.
[14]
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2021. Nearest neighbor machine translation. ICLR (2021).
[15]
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. ICLR (2020).
[16]
Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 689--698.
[17]
Huiying Liang, Brian Y Tsui, Hao Ni, Carolina CS Valentim, Sally L Baxter, Guangjian Liu, Wenjia Cai, Daniel S Kermany, Xin Sun, Jiancong Chen, et al. 2019. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nature medicine (2019).
[18]
Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole. SIGKDD (Aug 2017). https://doi.org/10.1145/3097983.3098088
[19]
Handong Ma, Wenbo Sheng, Jiyu Li, Lengchen Hou, Jiafang Yang, Junjie Cai, Wenxiang Xu, and Shaodian Zhang. 2021. A novel hierarchical machine learning model for hospital-acquired venous thromboembolism risk assessment among multiple-departments. Journal of Biomedical Informatics (2021).
[20]
Izet Masic, Milan Miokovic, and Belma Muhamedagic. 2008. Evidence based medicine--new approaches and challenges. Acta Informatica Medica (2008).
[21]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
[22]
Bernard Pereira, Suet-Feung Chin, Oscar M Rueda, Hans-Kristian Moen Vollan, Elena Provenzano, Helen A Bardwell, Michelle Pugh, Linda Jones, Roslin Russell, Stephen-John Sammut, et al. 2016. The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature communications (2016).
[23]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685--2692.
[24]
Tobias Plötz and Stefan Roth. 2018. Neural nearest neighbors networks. arXiv preprint arXiv:1810.12575 (2018).
[25]
Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379--1389.
[26]
Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, and Yong Yu. 2020. User behavior retrieval for click-through rate prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2347--2356.
[27]
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In ICDM.
[28]
David L Sackett. 1997. Evidence-based medicine. In Seminars in perinatology, Vol. 21. Elsevier, 3--5.
[29]
Nick van Es, Marcello Di Nisio, Gabriela Cesarman, Ankie Kleinjan, Hans-Martin Otten, Isabelle Mahé, Ineke T Wilts, Desirée C Twint, Ettore Porreca, Oscar Arrieta, et al. 2017. Comparison of risk prediction scores for venous thromboembolism in cancer patients: a prospective cohort study. haematologica (2017).
[30]
Jake Zhao and Kyunghyun Cho. 2018. Retrieval-augmented convolutional neural networks for improved robustness against adversarial examples. arXiv preprint arXiv:1802.09502 (2018).
[31]
Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst., Vol. 23 (1998), 453--490.

Cited By

View all
  • (2024)CohortNet: Empowering Cohort Discovery for Interpretable Healthcare AnalyticsProceedings of the VLDB Endowment10.14778/3675034.367504117:10(2487-2500)Online publication date: 6-Aug-2024
  • (2023)ROMO: Retrieval-enhanced Offline Model-based OptimizationProceedings of the Fifth International Conference on Distributed Artificial Intelligence10.1145/3627676.3627685(1-9)Online publication date: 30-Nov-2023
  • (2023)Data-Augmentation-Enabled Continuous User Authentication via Passive Vibration ResponseIEEE Internet of Things Journal10.1109/JIOT.2023.326427410:16(14137-14151)Online publication date: 15-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. disease risk assessment
  2. gradient boosting trees
  3. health informatics
  4. information retrieval

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)9
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CohortNet: Empowering Cohort Discovery for Interpretable Healthcare AnalyticsProceedings of the VLDB Endowment10.14778/3675034.367504117:10(2487-2500)Online publication date: 6-Aug-2024
  • (2023)ROMO: Retrieval-enhanced Offline Model-based OptimizationProceedings of the Fifth International Conference on Distributed Artificial Intelligence10.1145/3627676.3627685(1-9)Online publication date: 30-Nov-2023
  • (2023)Data-Augmentation-Enabled Continuous User Authentication via Passive Vibration ResponseIEEE Internet of Things Journal10.1109/JIOT.2023.326427410:16(14137-14151)Online publication date: 15-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media