Abstract
We proposed two ICD-based topic modeling methods, named ICD-1 and ICD-2, which can generate topics based on the International Classification of Diseases (ICD) codes assigned to the documents. We applied the two methods to the Pittsburgh EHR dataset. For comparison, we also ran LDA on the same dataset to generate topics. Then we experimented with the three topic models on both document retrieval and sentence retrieval. As a baseline, we performed both retrievals using a keyword-matching method named TF-IDF. We evaluated the results using three methods: precision at ten (P@10), document ranking correlation, and sentence relevance determination (in terms of precision, recall, and F-score), which were based on the review and annotation made on the retrieved documents by two medical experts. In the P@10 evaluation, ICD-2 method achieved the highest average P@10 value of 0.61. In document ranking correlation, ICD-1 method achieved the highest Pearson’s correlation coefficient of 0.709. In sentence relevance determination, ICD-1 method achieved the highest F-score of 0.655. Overall, the ICD-based methods outperformed LDA and TF-IDF in the experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jha, A.K., DesRoches, C.M., Kralovec, P.D., Joshi, M.S.: A progress report on electronic health records in U.S. hospitals. Health Aff. 29(10), 1951–1957 (2010)
Schuemie, M.J., Sen, E., t Jong, G.W., van Soest, E.M., Sturkenboom, M.C., Kors, J.A.: Automating classification of free-text electronic health records for epidemiological studies. Pharmacoepidemiol. Drug Saf. 21(6), 651–658 (2012)
Yli-Hietanen, J., Niiranen, S., Aswell, M., Nathanson, L.: Domain-specific analytical language modeling–the chief complaint as a case study. Int. J. Med. Inf. 78(12), e27-30 (2009)
Hripcsak, G., Friedman, C., Alderson, P.O., DuMouchel, W., Johnson, S.B., Clayton, P.D.: Unlocking clinical data from narrative reports: a study of natural language processing. Ann. Intern. Med. 122(9), 681–688 (1995)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: Neural Information Processing Systems 2007 (2007)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 248–256 (2009)
Cohen, R., Aviram, I., Elhadad, M., Elhadad, N.: Redundancy-aware topic modeling for patient record notes. PLoS ONE 9(2), e87555 (2014)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. SIGIR Forum 39, 178–185 (2006)
Bisgin, H., Liu, Z., Fang, H., Xu, X., Tong, W.: Mining FDA drug labels using an unsupervised learning technique–topic modeling. BMC Bioinform. 12(Suppl 10), S11 (2011)
Chen, Y., Yin, X., Li, Z., Hu, X., Huang, J.X.: A LDA-based approach to promoting ranking diversity for genomics information retrieval. BMC Genomics 13(Suppl 3), S2 (2012)
Arnold, C.W., El-Saden, S.M., Bui, A.A., Taira, R.: Clinical case-based retrieval using latent topic analysis. IN: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2010, pp. 26–30 (2010)
Zeng, Q.T., Redd, D., Rindflesch, T., Nebeker, J.: Synonym, topic model and predicate-based query expansion for retrieving clinical documents. In: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2012, pp. 1050–1059 (2012)
Shea, A.M., Curtis, L.H., Szczech, L.A., Schulman, K.A.: Sensitivity of international classification of diseases codes for hyponatremia among commercially insured outpatients in the United States. BMC Nephrol. 9, 5 (2008)
Guidelines for the 2012 TREC Medical Records Track. http://www-nlpir.nist.gov/projects/trecmed/2012
Edinger, T., Cohen, A.M., Bedrick, S., Ambert, K., Hersh, W.: Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. In: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2012, pp. 180–188 (2012)
Zeng, Q.T., Redd, D., Divita, G., Jarad, S., Brandt, C., Nebeker, J.R.: Characterizing clinical text and sublanguage: a case study of the VA clinical notes. J. Health Med. Informat. S3, 001 (2011)
MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu
Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 100–108 (2010)
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296 (2009)
Bui, D., Redd, D., Rindflesch, T., Zeng-Treitler, Q.: An ensemble approach for expanding queries. In: Proceedings of The Twenty-First Text REtrieval Conference (TREC 2012) (2013)
Acknowledgments
This work was funded by VA grants CHIR HIR 08-374 and VINCI HIR-08-204.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shao, Y., Morris, R.S., Bray, B.E., Zeng-Treitler, Q. (2022). Topic Modeling Based on ICD Codes for Clinical Documents. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 295. Springer, Cham. https://doi.org/10.1007/978-3-030-82196-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-82196-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82195-1
Online ISBN: 978-3-030-82196-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)