Abstract
Evidence-based medicine often involves the identification of patients with similar conditions, which are often captured in ICD (International Classification of Diseases (World Health Organization 2013)) code sequences. With no satisfying prior solutions for matching ICD-10 code sequences, this paper presents a method which effectively captures the clinical similarity among routine patients who have multiple comorbidities and complex care needs. Our method leverages the recent progress in representation learning of individual ICD-10 codes, and it explicitly uses the sequential order of codes for matching. Empirical evaluation on a state-wide cancer data collection shows that our proposed method achieves significantly higher matching performance compared with state-of-the-art methods ignoring the sequential order. Our method better identifies similar patients in a number of clinical outcomes including readmission and mortality outlook. Although this paper focuses on ICD-10 diagnosis code sequences, our method can be adapted to work with other codified sequence data.







Similar content being viewed by others
Notes
The R codes of WVM and others are available at https://github.com/nphdang/WVM
Ethics approval was obtained from the NSW Population and Health Services Research Ethics Committee (AU RED Reference: HREC/15/CIPHS/1)
We exclude the runtime of learning ICD code vectors in CSM and our WVM since this task is negligible, which only takes 106 (second) in our experiment.
References
World Health Organization: International Classification of Diseases (ICD). http://www.who.int/classifications/icd/en/, 2013
World Health Organization: International statistical classification of diseases and related health problems 10th revision. [Online]. Available: http://apps.who.int/classifications/icd10/browse/2010/en, 2010
Australian Consortium for Classification Development: ICD-10-AM. [Online]. Available: https://www.accd.net.au/Icd10.aspx, 2017
O’Malley, K., Cook, K., Price, M., Wildes, K. R., Hurdle, J., and Ashton, C., Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40:1620–1639, 2005.
Wang, F., Hu, J., and Sun, J.: Medical prognosis based on patient similarity and expert feedback. In: The 21st International Conference on Pattern Recognition, pp. 1799–1802, IEEE, 2012.
Choi, E., Schuetz, A., Stewart, W. F., and Sun, J.: Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv:1602.03686, 2016
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119, 2013.
Lee, J., Maslove, D.M., and Dubin, J., Personalized mortality prediction driven by electronic medical data and a patient similarity metric. PloS One 10(5):e0127428, 2015.
Carnaby-Mann, G., and Crary, M., Mcneill dysphagia therapy program: a case-control study. Arch. Phys. Med. Rehabil. 91(5):743–749, 2010.
Hielscher, T., Spiliopoulou, M., Völzke, H., and Kühn, J.-P.: Using participant similarity for the classification of epidemiological data on hepatic steatosis. In: The 27th International Symposium on Computer-Based Medical Systems, pp. 1–7, IEEE, 2014.
Le, Q, and Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196, 2014.
Levy, O., Goldberg, Y., and Dagan, I., Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3:211–225, 2015.
Grover, A, and Leskovec, J.: node2vec: scalable feature learning for networks in KDD. In: ACM, pp. 855–864, 2016.
Nguyen, D., Luo, W., Nguyen, T. D., Venkatesh, S., and Phung, D.: Learning graph representation via frequent subgraphs. In: SDM. Accepted, SIAM, 2018.
Moen, H., Ginter, F., Marsi, E., Peltonen, L.-M., Salakoski, T., and Salanterä, S., Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. BMC Med. Inform. Decis. Mak. 15(2):1, 2015.
Nguyen, P., Tran, T., Wickramasinghe, N., and Venkatesh, S., Deepr: a convolutional net for medical records. IEEE J. Biomed. Health Inform. 21(1):22–30, 2017.
Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., Tejedor-Sojo, J., and Sun. J.: Multi-layer representation learning for medical concepts in KDD. In: ACM, pp. 1495–1504, 2016.
Choi, Y., Chiu, C. Y.-I., and Sontag, D.: Learning low-dimensional representations of medical concepts. In: AMIA Summits on Translational Science Proceedings, pp. 41–51, 2016.
Mikolov, T., Chen, K., Corrado, G., and Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013
Pearce, N., Analysis of matched case-control studies. BMJ 352:i969, 2016.
Nguyen, D., Luo, W., Phung, D., and Venkatesh, S.: Exceptional contrast set mining: moving beyond the deluge of the obvious. In: Australasian Joint Conference on Artificial Intelligence, pp. 455–468. Springer, Berlin, 2016.
Bigus, J., Campbell, M., Carmeli, B., Cefkin, M., Chang, H., Chen-Ritzo, C.-H., Cody, W., Ebadollahi, S., Evfimievski, A., Farkash, A., et al., Information technology for healthcare transformation. IBM Journal of Research and Development 55(5):6–20, 2011.
Thomas, K., Rahman, M., Mor, V., and Intrator, O., Influence of hospital and nursing home quality on hospital readmissions. The American Journal of Managed Care 20(11):e523, 2014.
Håkonsen, S., Pedersen, P., Bjerrum, M., Bygholm, A., and Peters, M., Nursing minimum data sets for documenting nutritional care for adults in primary healthcare: a scoping review. JBI Database of Systematic Reviews and Implementation Reports 16(1):117–139, 2018.
Maaten, L. V. D., and Hinton, G., Visualizing data using t-sne. Journal of Machine Learning Research 9: 2579–2605, 2008.
Futoma, J., Morris, J., and Lucas, J., A comparison of models for predicting early hospital readmissions. Journal of Biomedical Informatics 56:229–238, 2015.
Pham, T., Tran, T., Phung, D., and Venkatesh, S., Deepcare: a deep dynamic memory model for predictive medicine in PAKDD, pp. 30–41. Berlin: Springer, 2016.
Turgeman, L., May, J., and Sciulli, R., Insights from a machine learning model for predicting the hospital length of stay (los) at the time of admission. Expert Systems with Applications 78:376–385, 2017.
Chaou, C.-H., Chen, H.-H., Chang, S.-H., Tang, P., Pan, S.-L., Yen, A. M.-F., and Chiu, T.-F., Predicting length of stay among patients discharged from the emergency departmentusing an accelerated failure time model. PloS One 12(1):e0165756, 2017.
Nguyen, D., Nguyen, T. D., Luo, W., and Venkatesh, S.: Trans2vec: learning transaction embedding via items and frequent itemsets. In: PAKDD. Accepted. Springer, Berlin, 2018.
Pobiedina, N., and Ichise, R., Citation count prediction as a link prediction problem. Applied Intelligence 44(2):252–268, 2016.
Acknowledgments
This work is partially supported by the Telstra-Deakin Centre of Excellence (CoE) in Big Data and Machine Learning. Dinh Phung gratefully acknowledges the partial support from the Australian Research Council (ARC).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors have no conflict of interest to declare.
Ethical Approval
Ethics approval was obtained from the New South Wales Population and Health Services Research Ethics Committee (AU RED Reference: HREC/15/CIPHS/1).
Informed Consent
This study is a secondary analysis of routinely collected data, and the consent had been obtained by the original data guarantor.
Additional information
This article is part of the Topical Collection on Patient Facing Systems
Rights and permissions
About this article
Cite this article
Nguyen, D., Luo, W., Venkatesh, S. et al. Effective Identification of Similar Patients Through Sequential Matching over ICD Code Embedding. J Med Syst 42, 94 (2018). https://doi.org/10.1007/s10916-018-0951-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10916-018-0951-4