Skip to main content
Log in

Determination of Disease from Discharge Summaries

A Text Mining Approach

  • Article
  • Published:
The Review of Socionetwork Strategies Aims and scope Submit manuscript

Abstract

Determining whether correct disease codes are included in discharge summaries is important for hospital management because submission of medical receipts with incorrect disease codes can result in loss of insurance reimbursement. Because medical information managers in large hospitals must evaluate more than 1000 summaries per month, an automated determination of discharge summaries will reduce their workload, allowing information managers to focus on complicated cases. This paper proposes a method of constructing classifiers of discharge summaries. In the first step, morphological analysis generated a term matrix from text data extracted from the hospital information system. Subsequently, important keywords were selected from an analysis of correspondence, training examples were generated, and machine learning methods were applied to the training examples. Several machine learning methods were compared using discharge summaries stored in the information system of Shimane University Hospital. A random forest method was found to be the best classifier when compared with deep learning, SVM and decision tree methods. Furthermore, the random forest method had a classification accuracy greater than 90%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Outpatient clinics utilize action-based payment systems, even in large hospitals.

  2. The method can also generate \(p (p\ge 3)\)-dimensional coordinates. However, higher dimensional coordinates did not provide better performance than the experiments shown below.

  3. Darch was removed from R package. Please check the github: https://github.com/maddin79/darch.

  4. Two-fold cross-validation was selected because its estimator resulted in the lowest estimate of parameters, such as accuracy, as well as minimizing estimates of bias.

  5. DPC codes are a three-level hierarchical system, with each DPC code defined as a tree. The first level denotes the type of disease, the second level denotes the primary treatment selected for that patient, and the third-level shows any additional therapy. Thus, in the tables, characteristics of codes were representative of similarities.

References

  1. Discharge summary, http://medical-dictionary.thefreedictionary.com/discharge+summary. Accessed Feb 14, 2021

  2. Deáth, G. (1999). Principal curves: A new technique for indirect and direct gradient analysis. Ecology, 80(7), 2237–2253.

    Article  Google Scholar 

  3. Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84(406), 502–516.

    Article  Google Scholar 

  4. IgakuTsushinsha (ed.) (2020). Quick Reference of DPC points (in Japanese). IgakuTsushinsha, Tokyo

  5. Ishida, M. (2016). Rmecab. http://rmecab.jp/wiki/index.php?RMeCabFunctions

  6. JONES, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.

    Article  Google Scholar 

  7. Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab - an S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20. http://www.jstatsoft.org/v11/i09/

  8. Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis, 53(11), 3735–3745. https://doi.org/10.1016/j.csda.2009.04.009.

    Article  Google Scholar 

  9. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. http://CRAN.R-project.org/doc/Rnews/

  10. Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309–317.

    Article  Google Scholar 

  11. Mares, M. A., Wang, S., & Guo, Y. (2016). Combining multiple feature selection methods and deep learning for high-dimensional data. Transactions on Machine Learning and Data Mining, 9, 27–45.

    Google Scholar 

  12. Nezhad, M. Z., Zhu, D., Li, X., Yang, K., & Levy, P. (2017). SAFS: A deep feature selection approach for precision medicine. arXiv:1704.05960

  13. Podani, J., & Miklós, I. (2002). Resemblance coefficients and the horseshoe effect in principal coordinates analysis. Ecology, 83(12), 3331–3343.

    Article  Google Scholar 

  14. Therneau, T. M., & Atkinson, E. J. (2015). An Introduction to Recursive Partitioning Using the RPART Routines. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

  15. Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, 4th edn. http://www.stats.ox.ac.uk/pub/MASS4, iSBN 0-387-95457-0

Download references

Acknowledgements

This research was supported by a Grant-in-Aid for Scientific Research (B) 18H03289 from the Japan Society for the Promotion of Science(JSPS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shusaku Tsumoto.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by a Grant-in-Aid for Scientific Research (B) 18H03289 from the Japan Society for the Promotion of Science (JSPS). On behalf of all authors, the corresponding author states that there are no conflicts of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsumoto, S., Kimura, T. & Hirano, S. Determination of Disease from Discharge Summaries. Rev Socionetwork Strat 15, 49–66 (2021). https://doi.org/10.1007/s12626-021-00076-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12626-021-00076-7

Keywords

Navigation