Topic Modeling Based on ICD Codes for Clinical Documents

Shao, Yijun; Morris, Rebecca S.; Bray, Bruce E.; Zeng-Treitler, Qing

doi:10.1007/978-3-030-82196-8_14

Yijun Shao^10,11,
Rebecca S. Morris¹²,
Bruce E. Bray^13,14 &
…
Qing Zeng-Treitler^10,11

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 295))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

967 Accesses

Abstract

We proposed two ICD-based topic modeling methods, named ICD-1 and ICD-2, which can generate topics based on the International Classification of Diseases (ICD) codes assigned to the documents. We applied the two methods to the Pittsburgh EHR dataset. For comparison, we also ran LDA on the same dataset to generate topics. Then we experimented with the three topic models on both document retrieval and sentence retrieval. As a baseline, we performed both retrievals using a keyword-matching method named TF-IDF. We evaluated the results using three methods: precision at ten (P@10), document ranking correlation, and sentence relevance determination (in terms of precision, recall, and F-score), which were based on the review and annotation made on the retrieved documents by two medical experts. In the P@10 evaluation, ICD-2 method achieved the highest average P@10 value of 0.61. In document ranking correlation, ICD-1 method achieved the highest Pearson’s correlation coefficient of 0.709. In sentence relevance determination, ICD-1 method achieved the highest F-score of 0.655. Overall, the ICD-based methods outperformed LDA and TF-IDF in the experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jha, A.K., DesRoches, C.M., Kralovec, P.D., Joshi, M.S.: A progress report on electronic health records in U.S. hospitals. Health Aff. 29(10), 1951–1957 (2010)
Article Google Scholar
Schuemie, M.J., Sen, E., t Jong, G.W., van Soest, E.M., Sturkenboom, M.C., Kors, J.A.: Automating classification of free-text electronic health records for epidemiological studies. Pharmacoepidemiol. Drug Saf. 21(6), 651–658 (2012)
Article Google Scholar
Yli-Hietanen, J., Niiranen, S., Aswell, M., Nathanson, L.: Domain-specific analytical language modeling–the chief complaint as a case study. Int. J. Med. Inf. 78(12), e27-30 (2009)
Article Google Scholar
Hripcsak, G., Friedman, C., Alderson, P.O., DuMouchel, W., Johnson, S.B., Clayton, P.D.: Unlocking clinical data from narrative reports: a study of natural language processing. Ann. Intern. Med. 122(9), 681–688 (1995)
Article Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: Neural Information Processing Systems 2007 (2007)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 248–256 (2009)
Google Scholar
Cohen, R., Aviram, I., Elhadad, M., Elhadad, N.: Redundancy-aware topic modeling for patient record notes. PLoS ONE 9(2), e87555 (2014)
Article Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. SIGIR Forum 39, 178–185 (2006)
Google Scholar
Bisgin, H., Liu, Z., Fang, H., Xu, X., Tong, W.: Mining FDA drug labels using an unsupervised learning technique–topic modeling. BMC Bioinform. 12(Suppl 10), S11 (2011)
Article Google Scholar
Chen, Y., Yin, X., Li, Z., Hu, X., Huang, J.X.: A LDA-based approach to promoting ranking diversity for genomics information retrieval. BMC Genomics 13(Suppl 3), S2 (2012)
Article Google Scholar
Arnold, C.W., El-Saden, S.M., Bui, A.A., Taira, R.: Clinical case-based retrieval using latent topic analysis. IN: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2010, pp. 26–30 (2010)
Google Scholar
Zeng, Q.T., Redd, D., Rindflesch, T., Nebeker, J.: Synonym, topic model and predicate-based query expansion for retrieving clinical documents. In: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2012, pp. 1050–1059 (2012)
Google Scholar
Shea, A.M., Curtis, L.H., Szczech, L.A., Schulman, K.A.: Sensitivity of international classification of diseases codes for hyponatremia among commercially insured outpatients in the United States. BMC Nephrol. 9, 5 (2008)
Article Google Scholar
Guidelines for the 2012 TREC Medical Records Track. http://www-nlpir.nist.gov/projects/trecmed/2012
Edinger, T., Cohen, A.M., Bedrick, S., Ambert, K., Hersh, W.: Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. In: AMIA Annual Symposium Proceedings/AMIA Symposium AMIA Symposium 2012, pp. 180–188 (2012)
Google Scholar
Zeng, Q.T., Redd, D., Divita, G., Jarad, S., Brandt, C., Nebeker, J.R.: Characterizing clinical text and sublanguage: a case study of the VA clinical notes. J. Health Med. Informat. S3, 001 (2011)
Google Scholar
MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu
Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 100–108 (2010)
Google Scholar
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296 (2009)
Google Scholar
Bui, D., Redd, D., Rindflesch, T., Zeng-Treitler, Q.: An ensemble approach for expanding queries. In: Proceedings of The Twenty-First Text REtrieval Conference (TREC 2012) (2013)
Google Scholar

Download references

Acknowledgments

This work was funded by VA grants CHIR HIR 08-374 and VINCI HIR-08-204.

Author information

Authors and Affiliations

Biomedical Informatics Center, George Washington University, Washington, DC, USA
Yijun Shao & Qing Zeng-Treitler
Washington DC VA Medical Center, Washington, DC, USA
Yijun Shao & Qing Zeng-Treitler
School of Medicine, University of Utah, Salt Lake City, UT, USA
Rebecca S. Morris
VA Salt Lake City Health Care System, Salt Lake City, UT, USA
Bruce E. Bray
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
Bruce E. Bray

Authors

Yijun Shao
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca S. Morris
View author publications
You can also search for this author in PubMed Google Scholar
Bruce E. Bray
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zeng-Treitler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yijun Shao .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shao, Y., Morris, R.S., Bray, B.E., Zeng-Treitler, Q. (2022). Topic Modeling Based on ICD Codes for Clinical Documents. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 295. Springer, Cham. https://doi.org/10.1007/978-3-030-82196-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-82196-8_14
Published: 03 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82195-1
Online ISBN: 978-3-030-82196-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics