Exploring the Effects of Data Distribution in Missing Data Imputation

Pompeu Soares, Jastin; Seoane Santos, Miriam; Henriques Abreu, Pedro; Araújo, Hélder; Santos, João

doi:10.1007/978-3-030-01768-2_21

Jastin Pompeu Soares¹⁶,
Miriam Seoane Santos¹⁶,
Pedro Henriques Abreu¹⁶,
Hélder Araújo¹⁷ &
…
João Santos¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11191))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1334 Accesses

Abstract

In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On Combining Imputation Methods for Handling Missing Data

Evaluating the Performance of Bayesian Approach for Imputing Missing Data under different Missingness Mechanism

Article 12 August 2024

Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values

References

Abreu, P.H., Santos, M.S., Abreu, M.H., Andrade, B., Silva, D.C.: Predicting breast cancer recurrence using machine learning techniques: a systematic review. ACM Comput. Surv. (CSUR) 49(3), 52 (2016)
Article Google Scholar
Aisha, N., Adam, M.B., Shohaimi, S.: Effect of missing value methods on bayesian network classification of hepatitis data. Int. J. Comput. Sci. Telecommun. 4(6), 8–12 (2013)
Google Scholar
Amiri, M., Jensen, R.: Missing data imputation using fuzzy-rough methods. Neurocomputing 205, 152–164 (2016)
Article Google Scholar
Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. HIS 87(251–260), 48 (2002)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Rregression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Chambers, R.: Evaluation criteria for statistical editing and imputation, national statistics methodological series no. 28. University of Southampton (2001)
Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Article Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst. with Appl. 40(4), 1333–1341 (2013)
Article Google Scholar
Howell, D.C.: The treatment of missing data. The Sage Handbook of Social Science Methodology, pp. 208–224. Sage Publications, Thousand Oaks (2007)
Google Scholar
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Enviro. 38(18), 2895–2907 (2004)
Article Google Scholar
Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)
Book Google Scholar
Lopes, R.H.: Kolmogorov-smirnov test. International Encyclopedia of Statistical Science, pp. 718–720. Springer, New York (2011)
Chapter Google Scholar
Nanni, L., Lumini, A., Brahnam, S.: A classifier ensemble approach for the missing feature problem. Artif. Intell. Med. 55(1), 37–50 (2012)
Article Google Scholar
Pigott, T.D.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)
Article Google Scholar
Rahman, M.M., Davis, D.N.: Fuzzy unordered rules induction algorithm used as missing value imputation methods for k-mean clustering on real cardiovascular data. In: Proceedings of the World Congress on Engineering I, pp. 391–394 (2012)
Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowledge-Based Syst. 53, 51–65 (2013)
Article Google Scholar
Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inf. 58, 49–59 (2015)
Article Google Scholar
Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: Artificial Intelligence in Medicine, pp. 285–294. Springer International Publishing, Cham (2017)
Chapter Google Scholar
Sivapriya, T., Kamal, A.N.B., Thavavel, V.: Imputation and classification of missing data using least square support vector machines-a new approach in dementia diagnosis. Int. J. Adv. Res. Artif. Intell. 1(4), 29–33 (2012)
Google Scholar
Sorjamaa, A., Corona, F., Miche, Y., Merlin, P., Maillet, B., Séverin, E., Lendasse, A.: Sparse linear combination of soms for data imputation: application to financial database. In: Príncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 290–297. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02397-2_33
Chapter Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)
Google Scholar

Download references

Acknowledgments

This article is a result of the project NORTE-01-0145-FEDER-000027, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).

Author information

Authors and Affiliations

CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Jastin Pompeu Soares, Miriam Seoane Santos & Pedro Henriques Abreu
ISR, Department of Electrical and Computer Engineering, University of Coimbra, Coimbra, Portugal
Hélder Araújo
IPO-Porto Research Centre (CI-IPOP), Porto, Portugal
João Santos

Authors

Jastin Pompeu Soares
View author publications
You can also search for this author in PubMed Google Scholar
Miriam Seoane Santos
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Henriques Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Hélder Araújo
View author publications
You can also search for this author in PubMed Google Scholar
João Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Henriques Abreu .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Wouter Duivesteijn
Department of Information and Computing Sciences, University Utrecht, Utrecht, The Netherlands
Arno Siebes
University of Helsinki, Helsinki, Finland
Antti Ukkonen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pompeu Soares, J., Seoane Santos, M., Henriques Abreu, P., Araújo, H., Santos, J. (2018). Exploring the Effects of Data Distribution in Missing Data Imputation. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-01768-2_21
Published: 05 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01767-5
Online ISBN: 978-3-030-01768-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics