skip to main content
10.1145/3534678.3542675acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Automatic Phenotyping by a Seed-guided Topic Model

Published: 14 August 2022 Publication History

Abstract

Electronic health records (EHRs) provide rich clinical information and the opportunities to extract epidemiological patterns to understand and predict patient disease risks with suitable machine learning methods such as topic models. However, existing topic models do not generate identifiable topics each predicting a unique phenotype. One promising direction is to use known phenotype concepts to guide topic inference. We present a seed-guided Bayesian topic model called MixEHR-Seed with 3 contributions: (1) for each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a small set of key EHR codes and a regular topic distribution over the entire EHR vocabulary; (2) we model age-dependent disease progression as Markovian dynamic topic priors; (3) we infer seed-guided multi-modal topics over distinct EHR data types. For inference, we developed a variational inference algorithm. Using MixEHR-Seed, we inferred 1569 PheCode-guided phenotype topics from an EHR database in Quebec, Canada covering 1.3 million patients for up to 20-year follow-up with 122 million records for 8539 and 1126 unique diagnostic and drug codes, respectively. We observed (1) accurate phenotype prediction by the guided topics, (2) clinically relevant PheCode-guided disease topics, (3) meaningful age-dependent disease prevalence. Source code is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Seed.

References

[1]
Yuri Ahuja, Doudou Zhou, Zeling He, Jiehuan Sun, Victor Castro, Vivian Gainer, Shawn Murphy, Chuan Hong, and Tianxi Cai. 2020. sureLDA: A multidisease automated phenotyping method for the electronic health record. Journal of the American Medical Informatics Association: JAMIA 27 (06 2020). https://doi.org/10.1093/jamia/ocaa079
[2]
Yuri Ahuja, Yuesong Zou, Aman Verma, David Buckeridge, and Yue Li. 2021. MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. bioRxiv (2021). https: //doi.org/10.1101/2021.12.17.473215
[3]
M T Betancourt, K C Roberts, T-L Bennett, E R Driscoll, G Jayaraman, and L Pelletier. 2014. Monitoring chronic diseases in Canada: the Chronic Disease Indicator Framework. Chronic diseases and injuries in Canada 34 Suppl 1 (2014), 1--30.
[4]
DM Blei, AY Ng, and MI Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993--1022.
[5]
David M. Blei and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). Association for Computing Machinery, New York, NY, USA, 113--120. https://doi.org/10.1145/1143844.1143859
[6]
David M. Blei and Jon D. McAuliffe. 2007. Supervised Topic Models. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 121--128.
[7]
You Chen, Joydeep Ghosh, Cosmin Bejan, Carl Gunter, Siddharth Gupta, Abel Kho, David Liebovitz, J. Sun, Joshua Denny, and Bradley Malin. 2015. Building Bridges Across Electronic Health Record Systems Through Inferred Phenotypic Topics. Journal of biomedical informatics 55 (04 2015). https://doi.org/10.1016/j. jbi.2015.03.011
[8]
Eliezer de Souza da Silva, Helge Langseth, and Heri Ramampiaro. 2017. Content-Based Social Recommendation with Poisson Matrix Factorization. In ECML/PKDD.
[9]
Joshua C Denny, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, Jill M Pulley, Andrea H Ramirez, Erica Bowton, et al. 2013. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology 31, 12 (2013), 1102--1111.
[10]
Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210.
[11]
Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The Dynamic Embedded Topic Model. arXiv:1907.05545 [cs.CL]
[12]
Saurav Ghosh, Prithwish Chakraborty, Elaine Nsoesie, Emily Cohn, Sumiko Mekaru, John Brownstein, and Naren Ramakrishnan. 2016. Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks. arXiv:1606.00411 7 (06 2016). https://doi.org/10.1038/srep40841
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735--1780. https://doi.org/10.1162/neco.1997.9. 8.1735
[14]
Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic Variational Inference. Journal of Machine Learning Research 14, 4 (2013), 1303--1347. http://jmlr.org/papers/v14/hoffman13a.html
[15]
George Hripcsak and DJ Albers. 2012. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association: JAMIA 20 (09 2012). https://doi.org/10.1136/amiajnl-2012-001145
[16]
Jagadeesh Jagarlamudi, Hal Daumé, and Raghavendra Udupa. 2012. Incorporating Lexical Priors into Topic Models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (Avignon, France) (EACL '12). Association for Computational Linguistics, USA, 204--213.
[17]
Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]
[18]
Yue Li, Pratheeksha Nair, Xing Han Lu, Zhi Wen, Yuening Wang, Amir Dehaghi, Yan Miao, Weiqi Liu, Tamas Ordog, Joanna Biernacka, Euijung Ryu, Janet Olson, Mark Frye, Aihua Liu, Liming Guo, Ariane Marelli, Yuri Ahuja, Jose Davila- Velderrain, and Manolis Kellis. 2020. Inferring multimodal latent topics from electronic health records. Nature Communications 11 (05 2020), 2536. https: //doi.org/10.1038/s41467-020-16378-3
[19]
Katherine Liao, Tianxi Cai, Guergana Savova, Shawn Murphy, Elizabeth Karlson, Ashwin Ananthakrishnan, Vivian Gainer, Stanley Shaw, Zongqi Xia, Peter Szolovits, Susanne Churchill, and Isaac Kohane. 2015. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350 (04 2015), h1885--h1885. https://doi.org/10.1136/bmj.h1885
[20]
Katherine Newton, Peggy Peissig, Abel Kho, Suzette Bielinski, Richard Berg, Vidhu Choudhary, Melissa Basford, Christopher Chute, Iftikhar Kullo, Rongling Li, Jennifer Pacheco, Luke Rasmussen, Leslie Spangler, and Joshua Denny. 2013. Validation of electronic medical record-based phenotyping algorithms: Results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association : JAMIA 20 (03 2013). https://doi.org/10.1136/amiajnl- 2012-000896
[21]
Sonal Parasrampuria and Jawanna Henry. 2019. Hospitals' Use of Electronic Health Records Data, 2015--2017.
[22]
Rimma Pivovarov, Adler Perotte, Edouard Grave, John Angiolillo, Chris Wiggins, and Noémie Elhadad. 2015. Learning Probabilistic Phenotypes from Heterogeneous EHR Data. Journal of biomedical informatics 58 (10 2015). https://doi.org/10.1016/j.jbi.2015.10.001
[23]
Arash Shaban-Nejad, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2016. PopHR: a knowledge-based platform to support integration, analysis, and visualization of population health data: The Population Health Record (PopHR). Annals of the New York Academy of Sciences 1387 (10 2016). https://doi.org/10.1111/nyas.13271
[24]
Ziyang Song, Xavier Sumba Toral, Yixin Xu, Aihua Liu, Liming Guo, Guido Powell, Aman Verma, David Buckeridge, Ariane Marelli, and Yue Li. 2021. Supervised Multi-Specialist Topic Model with Applications on Large-Scale Electronic Health Record Data. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (Gainesville, Florida) (BCB '21). Association for Computing Machinery, New York, NY, USA, Article 6, 26 pages. https://doi.org/10.1145/3459930.3469543
[25]
Yee Whye Teh, David Newman, and Max Welling. 2006. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In Proceedings of the 19th International Conference on Neural Information Processing Systems (Canada) (NIPS'06). MIT Press, Cambridge, MA, USA, 1353--1360.
[26]
YanshanWang, Yiqing Zhao, Terry Therneau, Elizabeth Atkinson, Ahmad P. Tafti, Nan Zhang, Shreyasee Amin, Andrew Limper, Sundeep Khosla, and Hongfang Liu. 2019. Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records. Journal of Biomedical Informatics 102 (12 2019), 103364. https://doi.org/10.1016/j.jbi.2019.103364
[27]
Wei-Qi Wei and Joshua Denny. 2015. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Medicine 7 (04 2015). https://doi.org/10.1186/s13073-015-0166-y
[28]
Mengru Yuan, Guido Powell, Maxime Lavigne, Anya Okhmatovskaia, and David Buckeridge. 2018. Initial Usability Evaluation of a Knowledge-Based Population Health Information System: The Population Health Record (PopHR). AMIA... Annual Symposium proceedings. AMIA Symposium 2017 (04 2018), 1878--1884.

Cited By

View all
  • (2024)A Data Analytics and Machine Learning Approach to Develop a Technology Roadmap for Next-Generation Logistics Utilizing Underground SystemsSustainability10.3390/su1615669616:15(6696)Online publication date: 5-Aug-2024
  • (2024)MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic ModelingProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701368(1-8)Online publication date: 22-Nov-2024
  • (2024)TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in HealthcareProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701364(1-10)Online publication date: 22-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. electronic health records
  2. predictive healthcare
  3. topic modeling
  4. variational autoencoder

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)14
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Data Analytics and Machine Learning Approach to Develop a Technology Roadmap for Next-Generation Logistics Utilizing Underground SystemsSustainability10.3390/su1615669616:15(6696)Online publication date: 5-Aug-2024
  • (2024)MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic ModelingProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701368(1-8)Online publication date: 22-Nov-2024
  • (2024)TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in HealthcareProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701364(1-10)Online publication date: 22-Nov-2024
  • (2024)MixEHR-SurGJournal of Biomedical Informatics10.1016/j.jbi.2024.104638153:COnline publication date: 17-Jul-2024
  • (2023)PheME: A deep ensemble framework for improving phenotype prediction from multi-modal data2023 IEEE 11th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI57859.2023.00044(268-275)Online publication date: 26-Jun-2023
  • (2023)Encouraging Sparsity in Neural Topic Modeling with Non-Mean-Field InferenceMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43421-1_9(142-158)Online publication date: 18-Sep-2023
  • (2023)Gaussian process regression and classification using International Classification of Disease codes as covariatesStat10.1002/sta4.61812:1Online publication date: 7-Oct-2023
  • (2022)Modeling electronic health record data using an end-to-end knowledge-graph-informed topic modelScientific Reports10.1038/s41598-022-22956-w12:1Online publication date: 25-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media