research-article

Automatic Phenotyping by a Seed-guided Topic Model

Authors:

David L. Buckeridge,

Yue LiAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4713 - 4723

https://doi.org/10.1145/3534678.3542675

Published: 14 August 2022 Publication History

Abstract

Electronic health records (EHRs) provide rich clinical information and the opportunities to extract epidemiological patterns to understand and predict patient disease risks with suitable machine learning methods such as topic models. However, existing topic models do not generate identifiable topics each predicting a unique phenotype. One promising direction is to use known phenotype concepts to guide topic inference. We present a seed-guided Bayesian topic model called MixEHR-Seed with 3 contributions: (1) for each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a small set of key EHR codes and a regular topic distribution over the entire EHR vocabulary; (2) we model age-dependent disease progression as Markovian dynamic topic priors; (3) we infer seed-guided multi-modal topics over distinct EHR data types. For inference, we developed a variational inference algorithm. Using MixEHR-Seed, we inferred 1569 PheCode-guided phenotype topics from an EHR database in Quebec, Canada covering 1.3 million patients for up to 20-year follow-up with 122 million records for 8539 and 1126 unique diagnostic and drug codes, respectively. We observed (1) accurate phenotype prediction by the guided topics, (2) clinically relevant PheCode-guided disease topics, (3) meaningful age-dependent disease prevalence. Source code is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Seed.

References

[1]

Yuri Ahuja, Doudou Zhou, Zeling He, Jiehuan Sun, Victor Castro, Vivian Gainer, Shawn Murphy, Chuan Hong, and Tianxi Cai. 2020. sureLDA: A multidisease automated phenotyping method for the electronic health record. Journal of the American Medical Informatics Association: JAMIA 27 (06 2020). https://doi.org/10.1093/jamia/ocaa079

[2]

Yuri Ahuja, Yuesong Zou, Aman Verma, David Buckeridge, and Yue Li. 2021. MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. bioRxiv (2021). https: //doi.org/10.1101/2021.12.17.473215

[3]

M T Betancourt, K C Roberts, T-L Bennett, E R Driscoll, G Jayaraman, and L Pelletier. 2014. Monitoring chronic diseases in Canada: the Chronic Disease Indicator Framework. Chronic diseases and injuries in Canada 34 Suppl 1 (2014), 1--30.

[4]

DM Blei, AY Ng, and MI Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[5]

David M. Blei and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). Association for Computing Machinery, New York, NY, USA, 113--120. https://doi.org/10.1145/1143844.1143859

Digital Library

[6]

David M. Blei and Jon D. McAuliffe. 2007. Supervised Topic Models. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 121--128.

[7]

You Chen, Joydeep Ghosh, Cosmin Bejan, Carl Gunter, Siddharth Gupta, Abel Kho, David Liebovitz, J. Sun, Joshua Denny, and Bradley Malin. 2015. Building Bridges Across Electronic Health Record Systems Through Inferred Phenotypic Topics. Journal of biomedical informatics 55 (04 2015). https://doi.org/10.1016/j. jbi.2015.03.011

[8]

Eliezer de Souza da Silva, Helge Langseth, and Heri Ramampiaro. 2017. Content-Based Social Recommendation with Poisson Matrix Factorization. In ECML/PKDD.

[9]

Joshua C Denny, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, Jill M Pulley, Andrea H Ramirez, Erica Bowton, et al. 2013. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology 31, 12 (2013), 1102--1111.

[10]

Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210.

Digital Library

[11]

Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The Dynamic Embedded Topic Model. arXiv:1907.05545 [cs.CL]

[12]

Saurav Ghosh, Prithwish Chakraborty, Elaine Nsoesie, Emily Cohn, Sumiko Mekaru, John Brownstein, and Naren Ramakrishnan. 2016. Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks. arXiv:1606.00411 7 (06 2016). https://doi.org/10.1038/srep40841

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735--1780. https://doi.org/10.1162/neco.1997.9. 8.1735

Digital Library

[14]

Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic Variational Inference. Journal of Machine Learning Research 14, 4 (2013), 1303--1347. http://jmlr.org/papers/v14/hoffman13a.html

Digital Library

[15]

George Hripcsak and DJ Albers. 2012. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association: JAMIA 20 (09 2012). https://doi.org/10.1136/amiajnl-2012-001145

[16]

Jagadeesh Jagarlamudi, Hal Daumé, and Raghavendra Udupa. 2012. Incorporating Lexical Priors into Topic Models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (Avignon, France) (EACL '12). Association for Computational Linguistics, USA, 204--213.

Digital Library

[17]

Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]

[18]

Yue Li, Pratheeksha Nair, Xing Han Lu, Zhi Wen, Yuening Wang, Amir Dehaghi, Yan Miao, Weiqi Liu, Tamas Ordog, Joanna Biernacka, Euijung Ryu, Janet Olson, Mark Frye, Aihua Liu, Liming Guo, Ariane Marelli, Yuri Ahuja, Jose Davila- Velderrain, and Manolis Kellis. 2020. Inferring multimodal latent topics from electronic health records. Nature Communications 11 (05 2020), 2536. https: //doi.org/10.1038/s41467-020-16378-3

[19]

Katherine Liao, Tianxi Cai, Guergana Savova, Shawn Murphy, Elizabeth Karlson, Ashwin Ananthakrishnan, Vivian Gainer, Stanley Shaw, Zongqi Xia, Peter Szolovits, Susanne Churchill, and Isaac Kohane. 2015. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350 (04 2015), h1885--h1885. https://doi.org/10.1136/bmj.h1885

[20]

Katherine Newton, Peggy Peissig, Abel Kho, Suzette Bielinski, Richard Berg, Vidhu Choudhary, Melissa Basford, Christopher Chute, Iftikhar Kullo, Rongling Li, Jennifer Pacheco, Luke Rasmussen, Leslie Spangler, and Joshua Denny. 2013. Validation of electronic medical record-based phenotyping algorithms: Results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association : JAMIA 20 (03 2013). https://doi.org/10.1136/amiajnl- 2012-000896

[21]

Sonal Parasrampuria and Jawanna Henry. 2019. Hospitals' Use of Electronic Health Records Data, 2015--2017.

[22]

Rimma Pivovarov, Adler Perotte, Edouard Grave, John Angiolillo, Chris Wiggins, and Noémie Elhadad. 2015. Learning Probabilistic Phenotypes from Heterogeneous EHR Data. Journal of biomedical informatics 58 (10 2015). https://doi.org/10.1016/j.jbi.2015.10.001

Digital Library

[23]

Arash Shaban-Nejad, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2016. PopHR: a knowledge-based platform to support integration, analysis, and visualization of population health data: The Population Health Record (PopHR). Annals of the New York Academy of Sciences 1387 (10 2016). https://doi.org/10.1111/nyas.13271

[24]

Ziyang Song, Xavier Sumba Toral, Yixin Xu, Aihua Liu, Liming Guo, Guido Powell, Aman Verma, David Buckeridge, Ariane Marelli, and Yue Li. 2021. Supervised Multi-Specialist Topic Model with Applications on Large-Scale Electronic Health Record Data. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (Gainesville, Florida) (BCB '21). Association for Computing Machinery, New York, NY, USA, Article 6, 26 pages. https://doi.org/10.1145/3459930.3469543

Digital Library

[25]

Yee Whye Teh, David Newman, and Max Welling. 2006. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In Proceedings of the 19th International Conference on Neural Information Processing Systems (Canada) (NIPS'06). MIT Press, Cambridge, MA, USA, 1353--1360.

[26]

YanshanWang, Yiqing Zhao, Terry Therneau, Elizabeth Atkinson, Ahmad P. Tafti, Nan Zhang, Shreyasee Amin, Andrew Limper, Sundeep Khosla, and Hongfang Liu. 2019. Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records. Journal of Biomedical Informatics 102 (12 2019), 103364. https://doi.org/10.1016/j.jbi.2019.103364

Digital Library

[27]

Wei-Qi Wei and Joshua Denny. 2015. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Medicine 7 (04 2015). https://doi.org/10.1186/s13073-015-0166-y

[28]

Mengru Yuan, Guido Powell, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2018. Initial Usability Evaluation of a Knowledge-Based Population Health Information System: The Population Health Record (PopHR). AMIA... Annual Symposium proceedings. AMIA Symposium 2017 (04 2018), 1878--1884.

Cited By

Youn SLee YHan HLee CSohn DLee C(2024)A Data Analytics and Machine Learning Approach to Develop a Technology Roadmap for Next-Generation Logistics Utilizing Underground SystemsSustainability10.3390/su1615669616:15(6696)Online publication date: 5-Aug-2024
https://doi.org/10.3390/su16156696
Wang RWang ZSong ZBuckeridge DLi Y(2024)MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic ModelingProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701368(1-8)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701368
Song ZLu QXu HZhu HBuckeridge DLi Y(2024)TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in HealthcareProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701364(1-10)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701364
Show More Cited By

Index Terms

Automatic Phenotyping by a Seed-guided Topic Model
1. Applied computing
  1. Life and medical sciences
    1. Health informatics
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
    2. Machine learning approaches
      1. Factorization methods
        Latent Dirichlet allocation

Recommendations

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record
Abstract
Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment ...
Graphical abstract

Display Omitted
Highlights
- MixEHR-Guided infers 1500 identifiable phenotypic topics from multi-modal EHR data.
- The inferred 1500 topics exhibit meaningful connections among ICD and ATC codes.
- The inferred phenotypic topics accurately recovered 9 out of 12 ...
Automatic phenotyping of electronical health record: PheVis algorithm
Graphical abstract

Display Omitted
Highlights
- Electronic Health Record phenotyping is challenging especially at the visit level.
- PheVis is a new unsupervised approach extending PheNorm to visit level.
- Incorporating accumulated features to take into account disease dynamic ...
Abstract
Electronic Health Records (EHRs) often lack reliable annotation of patient medical conditions. Phenorm, an automated unsupervised algorithm to identify patient medical conditions from EHR data, has been developed. PheVis extends PheNorm at the ...
MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling
BCB '24: Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Automatic subphenotyping from electronic health records (EHRs) provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
362
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)14

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Youn SLee YHan HLee CSohn DLee C(2024)A Data Analytics and Machine Learning Approach to Develop a Technology Roadmap for Next-Generation Logistics Utilizing Underground SystemsSustainability10.3390/su1615669616:15(6696)Online publication date: 5-Aug-2024
https://doi.org/10.3390/su16156696
Wang RWang ZSong ZBuckeridge DLi Y(2024)MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic ModelingProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701368(1-8)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701368
Song ZLu QXu HZhu HBuckeridge DLi Y(2024)TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in HealthcareProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701364(1-10)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701364
Li YYang AMarelli ALi Y(2024)MixEHR-SurGJournal of Biomedical Informatics10.1016/j.jbi.2024.104638153:COnline publication date: 17-Jul-2024
https://dl.acm.org/doi/10.1016/j.jbi.2024.104638
Zhang SLi HTang RDing SRasmy LZhi DZou NHu X(2023)PheME: A deep ensemble framework for improving phenotype prediction from multi-modal data2023 IEEE 11th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI57859.2023.00044(268-275)Online publication date: 26-Jun-2023
https://doi.org/10.1109/ICHI57859.2023.00044
Chen JWang RHe JLi M(2023)Encouraging Sparsity in Neural Topic Modeling with Non-Mean-Field InferenceMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43421-1_9(142-158)Online publication date: 18-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-43421-1_9
Srivastava SXu ZLi YStreet WGilbertson‐White S(2023)Gaussian process regression and classification using International Classification of Disease codes as covariatesStat10.1002/sta4.61812:1Online publication date: 7-Oct-2023
https://doi.org/10.1002/sta4.618
Zou YPesaranghader ASong ZVerma ABuckeridge DLi Y(2022)Modeling electronic health record data using an end-to-end knowledge-graph-informed topic modelScientific Reports10.1038/s41598-022-22956-w12:1Online publication date: 25-Oct-2022
https://doi.org/10.1038/s41598-022-22956-w

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten