Skip to main content

CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2023)

Abstract

Multi-label classification tasks are relevant in healthcare, as data samples are commonly associated with multiple interdependent, non-mutually exclusive outcomes. Incomplete label information often arises due to unrecorded outcomes at planned checkpoints, varying disease testing across patients, collection constraints, or human error. Dropping partially annotated samples can reduce data size, introduce bias, and compromise accuracy. To address these issues, this study introduces CORKI (Correlation-Optimised and Robust K Nearest Neighbours Imputation for Multi-label Classification), a data-centric method for partial annotation imputation in Multi-label data. This method employs proximity measures and an optional weighting term for outcome prevalence to tackle imbalanced labels. Additionally, it leverages different modalities of correlation that consider not only variable values but also missingness patterns. CORKI’s performance was compared with a domain-knowledge-based rule system and the standard sample-dropping approach on three public and one private cardiothoracic surgery datasets with diverse missing label rates. CORKI yielded performances comparable to those of the domain-knowledge approach, establishing itself as a reliable method, while being highly generalizable. Moreover, it was able to maintain imputation accuracy in demanding partial annotation scenarios, presenting drops of only 5% for missing rates of 50%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akbarnejad, A., Baghshah, M.S.: An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31, 229–242 (2019)

    Article  MATH  Google Scholar 

  2. Alday, E.A.P., et al.: Classification of 12-lead ECGs: the physionet/computing in cardiology challenge 2020. Physiol. Meas. 41(12), 124003 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ben-Baruch, E., et al.: Multi-label classification with partial annotations using class-aware selective loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4764–4772 (2022)

    Google Scholar 

  4. Cheng, Z., Zeng, Z.: Joint label-specific features and label correlation for multi-label learning with missing label. Appl. Intell. 50(11), 4029–4049 (2020). https://doi.org/10.1007/s10489-020-01715-2

    Article  MATH  Google Scholar 

  5. Curioso, I., et al.: Addressing the curse of missing data in clinical contexts: A novel approach to correlation-based imputation. J. King Saud Univ. Comput. Inf. Sci. 35(6), 101562 (2023)

    MATH  Google Scholar 

  6. Durand, T., Mehrasa, N., Mori, G.: Learning a deep convnet for multi-label classification with partial labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 647–657 (2019)

    Google Scholar 

  7. Fei, H., et al.: Topic-enhanced capsule network for multi-label emotion classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1839–1848 (2020)

    Google Scholar 

  8. Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(6), 411–444 (2014)

    Article  MATH  Google Scholar 

  9. Goldberger, A.L., et al.: PhysioBank, PhysioToolKit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)

    Article  PubMed  MATH  Google Scholar 

  10. Head, S.J., et al.: The European association for cardio-thoracic surgery (EACTS) Database: an introduction. Euro. J. Cardiothorac. Surg. 44(3), e175–e180 (2013)

    Article  MATH  Google Scholar 

  11. Huang, J., et al.: Improving multi-label classification with missing labels by learning label-specific features. Inf. Sci. 492, 124–146 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  12. Huang, J., et al.: Multi-label learning with missing and completely unobserved labels. Data Min. Knowl. Disc. 35(3), 1061–1086 (2021). https://doi.org/10.1007/s10618-021-00743-x

  13. Ibrahim, K.M., et al.: Confidence-based weighted loss for multi-label classification with missing labels. In: Proceedings of the 2020 International Conference on Multimedia Retrieval (2020)

    Google Scholar 

  14. Jain, V., Modhe, N., Rai, P.: Scalable generative models for multi-label learning with missing labels. In: International Conference on Machine Learning (2017)

    Google Scholar 

  15. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley (2019)

    Google Scholar 

  16. Liu, F., et al.: An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J. Med. Imag. Health Inf. 8(7), 1368–1373 (2018)

    MATH  Google Scholar 

  17. Mondéjar-Guerra, V., et al.: Heartbeat classification fusing temporal and morphological information of ECGs via ensemble of classifiers. Biomed. Signal Process. Control 47, 41–48 (2019)

    Google Scholar 

  18. Mukaka, M.M.: A guide to appropriate use of correlation coefficient in medical research. Malawi Med. J. 24(3), 69–71 (2012)

    PubMed  PubMed Central  Google Scholar 

  19. Rastogi, R., Mortaza, S.: Multi-label classification with missing labels using label correlation and robust structural learning. Knowl. Based Syst. 229, 107336 (2021)

    Google Scholar 

  20. Sai, Y.P., et al.: A review on arrhythmia classification using ECG signals. In: 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), pp. 1–6. IEEE (2020)

    Google Scholar 

  21. Schober, P., Boer, C., Schwarte, L.A.: Correlation coefficients: appropriate use and interpretation. Anesth. Analg. 126(5), 1763–1768 (2018)

    Article  PubMed  MATH  Google Scholar 

  22. Tahzeeb, S., Hasan, S.: A neural network-based multi-label classifier for protein function prediction. Eng. Technol. Appl. Sci. Res. 12(1), 7974–7981 (2022)

    Article  MATH  Google Scholar 

  23. Tan, A., et al.: Weak multi-label learning with missing labels via instance granular discrimination. Inf. Sci. 594, 200–216 (2022)

    Google Scholar 

  24. Tarekegn, A.N., Giacobini, M., Michalak, K.: A review of methods for imbalanced multi-label classification. Pattern Recogn. 118, 107965 (2021)

    Article  MATH  Google Scholar 

  25. Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  PubMed  MATH  Google Scholar 

  26. Wagner, P., et al.: PTB-XL, a large publicly available electrocardiography dataset. Sci. data 7(1), 154 (2020)

    Article  PubMed  PubMed Central  MATH  Google Scholar 

  27. Wu, B., et al.: Multi-label learning with missing labels. In: 2014 22nd International Conference on Pattern Recognition, pp. 1964–1968 (2014)

    Google Scholar 

  28. Xu, L., et al.: Learning low-rank label correlations for multi-label classification with missing labels. In: 2014 IEEE International Conference on Data Mining, pp. 1067–1072 (2014)

    Google Scholar 

  29. Zhang, C., et al.: Hybrid noise-oriented multilabel learning. IEEE Trans. Cybern. 50, 2837–2850 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by European funds through Plano de Recuperação e Resiliência, project “Center for Responsible AI” with project number 62_C645008882-00000055.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricardo Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, R. et al. (2025). CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74640-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74639-0

  • Online ISBN: 978-3-031-74640-6

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics