Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets

Santos, Miriam S.; Abreu, Pedro H.; Wilk, Szymon; Santos, João

doi:10.1007/978-3-030-59137-3_43

Miriam S. Santos^10,12,
Pedro H. Abreu¹⁰,
Szymon Wilk¹¹ &
…
João Santos¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12299))

Included in the following conference series:

International Conference on Artificial Intelligence in Medicine

2159 Accesses

Abstract

In healthcare domains, dealing with missing data is crucial since absent observations compromise the reliability of decision support models. K-nearest neighbours imputation has proven beneficial since it takes advantage of the similarity between patients to replace missing values. Nevertheless, its performance largely depends on the distance function used to evaluate such similarity. In the literature, k-nearest neighbours imputation frequently neglects the nature of data or performs feature transformation, whereas in this work, we study the impact of different heterogeneous distance functions on k-nearest neighbour imputation for biomedical datasets. Our results show that distance functions considerably impact the performance of classifiers learned from the imputed data, especially when data is complex.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Article 10 May 2015

Missing value imputation using unsupervised machine learning techniques

Article 08 July 2019

References

AbdAllah, L., Shimshoni, I.: K-means over incomplete datasets using mean Euclidean distance. MLDM 2016. LNCS (LNAI), vol. 9729, pp. 113–127. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41920-6_9
Chapter Google Scholar
Abreu, P.H., Santos, M.S., Abreu, M.H., Andrade, B., Silva, D.C.: Predicting breast cancer recurrence using machine learning techniques: a systematic review. ACM Comput. Surv. (CSUR) 49(3), 1–40 (2016)
Article Google Scholar
Amorim, J.P., Domingues, I., Abreu, P.H., Santos, J.: Interpreting deep learning models for ordinal problems. In: ESANN (2018)
Google Scholar
Belanche Muñoz, L.A., Hernández González, J.: Similarity networks for heterogeneous data. In: ESANN 2012, pp. 215–220 (2012)
Google Scholar
Das, S., Datta, S., Chaudhuri, B.B.: Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn. 81, 674–693 (2018)
Article Google Scholar
García-Laencina, P., Abreu, P.H., Abreu, M.H., Afonoso, N.: Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput. Biol. Med. 59, 125–133 (2015)
Article Google Scholar
Hu, L.-Y., Huang, M.-W., Ke, S.-W., Tsai, C.-F.: The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus 5(1), 1–9 (2016). https://doi.org/10.1186/s40064-016-2941-7
Article Google Scholar
Juhola, M., Laurikkala, J.: On metricity of two heterogeneous measures in the presence of missing values. Artif. Intell. Rev. 28(2), 163–178 (2007)
Article Google Scholar
Pereira, R.C., Santos, M.S., Rodrigues, P.P., Abreu, P.H.: MNAR imputation with distributed healthcare data. In: Moura Oliveira, P., Novais, P., Reis, L.P. (eds.) EPIA 2019. LNCS (LNAI), vol. 11805, pp. 184–195. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30244-3_16
Chapter Google Scholar
Sáez, J.A., Krawczyk, B., Woźniak, M.: Handling class label noise in medical pattern classification systems. J. Med. Inform. Technol. 24 (2015)
Google Scholar
Santos, M.S., Abreu, P.H., García-Laencina, P., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 58, 49–59 (2015)
Article Google Scholar
Santos, M.S., Abreu, P.H., Wilk, S., Santos, J.: How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recogn. Lett. 136, 111–119 (2020)
Article Google Scholar
Santos, M.S., Pereira, R.C., Costa, A., Soares, J., Santos, J., Abreu, P.H.: Generating synthetic missing data: a review by missing mechanism. IEEE Access 1(1), 1–18 (2019)
Google Scholar
Santos, M.S., Soares, J.P., Abreu, P.H., Araújo, H., Santos, J.: Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018)
Article Google Scholar
Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: ten Teije, A., Popow, C., Holmes, J.H., Sacchi, L. (eds.) AIME 2017. LNCS (LNAI), vol. 10259, pp. 285–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59758-4_33
Chapter Google Scholar
Tutz, G., Ramzan, S.: Improved methods for the imputation of missing data by nearest neighbor methods. Comput. Stat. Data Anal. 90, 84–99 (2015)
Article MathSciNet Google Scholar
Twala, B., Cartwright, M.: Ensemble missing data techniques for software effort prediction. Intell. Data Anal. 14(3), 299–331 (2010)
Article Google Scholar
Wilson, R., Martinez, T.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the project NORTE-01-0145-FEDER-000027 (Norte Portugal Regional Operational Programme – Norte 2020) and in part by the FCT Research Grant SFRH/BD/138749/2018.

Author information

Authors and Affiliations

CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Miriam S. Santos & Pedro H. Abreu
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Szymon Wilk
IPO-Porto Research Centre, Porto, Portugal
Miriam S. Santos & João Santos

Authors

Miriam S. Santos
View author publications
You can also search for this author in PubMed Google Scholar
Pedro H. Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Wilk
View author publications
You can also search for this author in PubMed Google Scholar
João Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miriam S. Santos .

Editor information

Editors and Affiliations

School of Nursing, University of Minnesota, Minneapolis, MN, USA
Martin Michalowski
Ben-Gurion University of the Negev, Tonawanda, NY, USA
Robert Moskovitch

Appendix

Table 4 presents the mathematical formulation for all distance functions described in Sect. 2.

Table 4. Mathematical formulation of heterogeneous distance functions that handle missing data.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, M.S., Abreu, P.H., Wilk, S., Santos, J. (2020). Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-59137-3_43
Published: 26 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59136-6
Online ISBN: 978-3-030-59137-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Missing value imputation using unsupervised machine learning techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CORKI: A Correlation-Driven Imputation Method for Partial Annotation Scenarios in Multi-label Clinical Problems

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Missing value imputation using unsupervised machine learning techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation