Abstract
Multi-label classification tasks are relevant in healthcare, as data samples are commonly associated with multiple interdependent, non-mutually exclusive outcomes. Incomplete label information often arises due to unrecorded outcomes at planned checkpoints, varying disease testing across patients, collection constraints, or human error. Dropping partially annotated samples can reduce data size, introduce bias, and compromise accuracy. To address these issues, this study introduces CORKI (Correlation-Optimised and Robust K Nearest Neighbours Imputation for Multi-label Classification), a data-centric method for partial annotation imputation in Multi-label data. This method employs proximity measures and an optional weighting term for outcome prevalence to tackle imbalanced labels. Additionally, it leverages different modalities of correlation that consider not only variable values but also missingness patterns. CORKI’s performance was compared with a domain-knowledge-based rule system and the standard sample-dropping approach on three public and one private cardiothoracic surgery datasets with diverse missing label rates. CORKI yielded performances comparable to those of the domain-knowledge approach, establishing itself as a reliable method, while being highly generalizable. Moreover, it was able to maintain imputation accuracy in demanding partial annotation scenarios, presenting drops of only 5% for missing rates of 50%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akbarnejad, A., Baghshah, M.S.: An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31, 229–242 (2019)
Alday, E.A.P., et al.: Classification of 12-lead ECGs: the physionet/computing in cardiology challenge 2020. Physiol. Meas. 41(12), 124003 (2020)
Ben-Baruch, E., et al.: Multi-label classification with partial annotations using class-aware selective loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4764–4772 (2022)
Cheng, Z., Zeng, Z.: Joint label-specific features and label correlation for multi-label learning with missing label. Appl. Intell. 50(11), 4029–4049 (2020). https://doi.org/10.1007/s10489-020-01715-2
Curioso, I., et al.: Addressing the curse of missing data in clinical contexts: A novel approach to correlation-based imputation. J. King Saud Univ. Comput. Inf. Sci. 35(6), 101562 (2023)
Durand, T., Mehrasa, N., Mori, G.: Learning a deep convnet for multi-label classification with partial labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 647–657 (2019)
Fei, H., et al.: Topic-enhanced capsule network for multi-label emotion classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1839–1848 (2020)
Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(6), 411–444 (2014)
Goldberger, A.L., et al.: PhysioBank, PhysioToolKit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
Head, S.J., et al.: The European association for cardio-thoracic surgery (EACTS) Database: an introduction. Euro. J. Cardiothorac. Surg. 44(3), e175–e180 (2013)
Huang, J., et al.: Improving multi-label classification with missing labels by learning label-specific features. Inf. Sci. 492, 124–146 (2019)
Huang, J., et al.: Multi-label learning with missing and completely unobserved labels. Data Min. Knowl. Disc. 35(3), 1061–1086 (2021). https://doi.org/10.1007/s10618-021-00743-x
Ibrahim, K.M., et al.: Confidence-based weighted loss for multi-label classification with missing labels. In: Proceedings of the 2020 International Conference on Multimedia Retrieval (2020)
Jain, V., Modhe, N., Rai, P.: Scalable generative models for multi-label learning with missing labels. In: International Conference on Machine Learning (2017)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley (2019)
Liu, F., et al.: An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J. Med. Imag. Health Inf. 8(7), 1368–1373 (2018)
Mondéjar-Guerra, V., et al.: Heartbeat classification fusing temporal and morphological information of ECGs via ensemble of classifiers. Biomed. Signal Process. Control 47, 41–48 (2019)
Mukaka, M.M.: A guide to appropriate use of correlation coefficient in medical research. Malawi Med. J. 24(3), 69–71 (2012)
Rastogi, R., Mortaza, S.: Multi-label classification with missing labels using label correlation and robust structural learning. Knowl. Based Syst. 229, 107336 (2021)
Sai, Y.P., et al.: A review on arrhythmia classification using ECG signals. In: 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), pp. 1–6. IEEE (2020)
Schober, P., Boer, C., Schwarte, L.A.: Correlation coefficients: appropriate use and interpretation. Anesth. Analg. 126(5), 1763–1768 (2018)
Tahzeeb, S., Hasan, S.: A neural network-based multi-label classifier for protein function prediction. Eng. Technol. Appl. Sci. Res. 12(1), 7974–7981 (2022)
Tan, A., et al.: Weak multi-label learning with missing labels via instance granular discrimination. Inf. Sci. 594, 200–216 (2022)
Tarekegn, A.N., Giacobini, M., Michalak, K.: A review of methods for imbalanced multi-label classification. Pattern Recogn. 118, 107965 (2021)
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Wagner, P., et al.: PTB-XL, a large publicly available electrocardiography dataset. Sci. data 7(1), 154 (2020)
Wu, B., et al.: Multi-label learning with missing labels. In: 2014 22nd International Conference on Pattern Recognition, pp. 1964–1968 (2014)
Xu, L., et al.: Learning low-rank label correlations for multi-label classification with missing labels. In: 2014 IEEE International Conference on Data Mining, pp. 1067–1072 (2014)
Zhang, C., et al.: Hybrid noise-oriented multilabel learning. IEEE Trans. Cybern. 50, 2837–2850 (2020)
Acknowledgements
This work was supported by European funds through Plano de Recuperação e Resiliência, project “Center for Responsible AI” with project number 62_C645008882-00000055.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, R. et al. (2025). CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-74640-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)