CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems

Santos, Ricardo; Ribeiro, Bruno; Curioso, Isabel; Barandas, Marília; V. Carreiro, André; Gamboa, Hugo; Coelho, Pedro; Fragata, José; Sousa, Inês

doi:10.1007/978-3-031-74640-6_1

Ricardo Santos ORCID: orcid.org/0000-0002-4478-2476^4,5,
Bruno Ribeiro⁴,
Isabel Curioso⁴,
Marília Barandas^4,5,
André V. Carreiro⁴,
Hugo Gamboa^4,5,
Pedro Coelho^6,7,
José Fragata^6,7 &
…
Inês Sousa⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2136))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

57 Accesses

Abstract

Multi-label classification tasks are relevant in healthcare, as data samples are commonly associated with multiple interdependent, non-mutually exclusive outcomes. Incomplete label information often arises due to unrecorded outcomes at planned checkpoints, varying disease testing across patients, collection constraints, or human error. Dropping partially annotated samples can reduce data size, introduce bias, and compromise accuracy. To address these issues, this study introduces CORKI (Correlation-Optimised and Robust K Nearest Neighbours Imputation for Multi-label Classification), a data-centric method for partial annotation imputation in Multi-label data. This method employs proximity measures and an optional weighting term for outcome prevalence to tackle imbalanced labels. Additionally, it leverages different modalities of correlation that consider not only variable values but also missingness patterns. CORKI’s performance was compared with a domain-knowledge-based rule system and the standard sample-dropping approach on three public and one private cardiothoracic surgery datasets with diverse missing label rates. CORKI yielded performances comparable to those of the domain-knowledge approach, establishing itself as a reliable method, while being highly generalizable. Moreover, it was able to maintain imputation accuracy in demanding partial annotation scenarios, presenting drops of only 5% for missing rates of 50%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets

The impact of imputation quality on machine learning classifiers for datasets with missing values

Article Open access 06 October 2023

MNAR Imputation with Distributed Healthcare Data

References

Akbarnejad, A., Baghshah, M.S.: An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31, 229–242 (2019)
Article MATH Google Scholar
Alday, E.A.P., et al.: Classification of 12-lead ECGs: the physionet/computing in cardiology challenge 2020. Physiol. Meas. 41(12), 124003 (2020)
Article MathSciNet MATH Google Scholar
Ben-Baruch, E., et al.: Multi-label classification with partial annotations using class-aware selective loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4764–4772 (2022)
Google Scholar
Cheng, Z., Zeng, Z.: Joint label-specific features and label correlation for multi-label learning with missing label. Appl. Intell. 50(11), 4029–4049 (2020). https://doi.org/10.1007/s10489-020-01715-2
Article MATH Google Scholar
Curioso, I., et al.: Addressing the curse of missing data in clinical contexts: A novel approach to correlation-based imputation. J. King Saud Univ. Comput. Inf. Sci. 35(6), 101562 (2023)
MATH Google Scholar
Durand, T., Mehrasa, N., Mori, G.: Learning a deep convnet for multi-label classification with partial labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 647–657 (2019)
Google Scholar
Fei, H., et al.: Topic-enhanced capsule network for multi-label emotion classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1839–1848 (2020)
Google Scholar
Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(6), 411–444 (2014)
Article MATH Google Scholar
Goldberger, A.L., et al.: PhysioBank, PhysioToolKit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
Article CAS PubMed MATH Google Scholar
Head, S.J., et al.: The European association for cardio-thoracic surgery (EACTS) Database: an introduction. Euro. J. Cardiothorac. Surg. 44(3), e175–e180 (2013)
Article MATH Google Scholar
Huang, J., et al.: Improving multi-label classification with missing labels by learning label-specific features. Inf. Sci. 492, 124–146 (2019)
Article MathSciNet MATH Google Scholar
Huang, J., et al.: Multi-label learning with missing and completely unobserved labels. Data Min. Knowl. Disc. 35(3), 1061–1086 (2021). https://doi.org/10.1007/s10618-021-00743-x
Ibrahim, K.M., et al.: Confidence-based weighted loss for multi-label classification with missing labels. In: Proceedings of the 2020 International Conference on Multimedia Retrieval (2020)
Google Scholar
Jain, V., Modhe, N., Rai, P.: Scalable generative models for multi-label learning with missing labels. In: International Conference on Machine Learning (2017)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley (2019)
Google Scholar
Liu, F., et al.: An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J. Med. Imag. Health Inf. 8(7), 1368–1373 (2018)
MATH Google Scholar
Mondéjar-Guerra, V., et al.: Heartbeat classification fusing temporal and morphological information of ECGs via ensemble of classifiers. Biomed. Signal Process. Control 47, 41–48 (2019)
Google Scholar
Mukaka, M.M.: A guide to appropriate use of correlation coefficient in medical research. Malawi Med. J. 24(3), 69–71 (2012)
CAS PubMed PubMed Central Google Scholar
Rastogi, R., Mortaza, S.: Multi-label classification with missing labels using label correlation and robust structural learning. Knowl. Based Syst. 229, 107336 (2021)
Google Scholar
Sai, Y.P., et al.: A review on arrhythmia classification using ECG signals. In: 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), pp. 1–6. IEEE (2020)
Google Scholar
Schober, P., Boer, C., Schwarte, L.A.: Correlation coefficients: appropriate use and interpretation. Anesth. Analg. 126(5), 1763–1768 (2018)
Article PubMed MATH Google Scholar
Tahzeeb, S., Hasan, S.: A neural network-based multi-label classifier for protein function prediction. Eng. Technol. Appl. Sci. Res. 12(1), 7974–7981 (2022)
Article MATH Google Scholar
Tan, A., et al.: Weak multi-label learning with missing labels via instance granular discrimination. Inf. Sci. 594, 200–216 (2022)
Google Scholar
Tarekegn, A.N., Giacobini, M., Michalak, K.: A review of methods for imbalanced multi-label classification. Pattern Recogn. 118, 107965 (2021)
Article MATH Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article CAS PubMed MATH Google Scholar
Wagner, P., et al.: PTB-XL, a large publicly available electrocardiography dataset. Sci. data 7(1), 154 (2020)
Article PubMed PubMed Central MATH Google Scholar
Wu, B., et al.: Multi-label learning with missing labels. In: 2014 22nd International Conference on Pattern Recognition, pp. 1964–1968 (2014)
Google Scholar
Xu, L., et al.: Learning low-rank label correlations for multi-label classification with missing labels. In: 2014 IEEE International Conference on Data Mining, pp. 1067–1072 (2014)
Google Scholar
Zhang, C., et al.: Hybrid noise-oriented multilabel learning. IEEE Trans. Cybern. 50, 2837–2850 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by European funds through Plano de Recuperação e Resiliência, project “Center for Responsible AI” with project number 62_C645008882-00000055.

Author information

Authors and Affiliations

Associação Fraunhofer Portugal Research, Porto, Portugal
Ricardo Santos, Bruno Ribeiro, Isabel Curioso, Marília Barandas, André V. Carreiro, Hugo Gamboa & Inês Sousa
LIBPhys-UNL, NOVA School of Science and Technology, Caparica, Portugal
Ricardo Santos, Marília Barandas & Hugo Gamboa
Comprehensive Health Research Center, NOVA Medical School, Lisboa, Portugal
Pedro Coelho & José Fragata
Hospital de Santa Marta, Centro Hospitalar Universitário Lisboa Central, Lisboa, Portugal
Pedro Coelho & José Fragata

Authors

Ricardo Santos
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Isabel Curioso
View author publications
You can also search for this author in PubMed Google Scholar
Marília Barandas
View author publications
You can also search for this author in PubMed Google Scholar
André V. Carreiro
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Gamboa
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Coelho
View author publications
You can also search for this author in PubMed Google Scholar
José Fragata
View author publications
You can also search for this author in PubMed Google Scholar
Inês Sousa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Santos .

Editor information

Editors and Affiliations

University of Turin, Turin, Italy
Rosa Meo
Sapienza University of Rome, Rome, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, R. et al. (2025). CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-74640-6_1
Published: 01 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics

CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems