Abstract
We investigate a new multivariate data imputation approach for dealing with variety of types of missingness. The proposed approach relies on the aggregation of the most suitable methods from a multitude of imputation techniques, adjusted to each feature of the dataset. We report results from comparison with two single imputation techniques (Random Guessing and Median Imputation) and four state-of-the-art multivariate methods (K-Nearest Neighbour Imputation, Bagged Tree Imputation, Missing Imputation Chained Equations, and Bayesian Principal Component Analysis Imputation) on several datasets from the public domain, demonstrating favorable performance for our model. The proposed method, namely Feature Guided Data Imputation is compared with the other tested methods in three different experimental settings: Missing Completely at Random, Missing at Random and Missing Not at Random with 25% missing data in the test set over five-fold cross validation. Furthermore, the proposed model has straightforward implementation and can easily incorporate other imputation techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Enders, C.K.: Applied Missing Data Analysis. Guildford Press, Guidford (2010)
Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. of Biometrics Biostat. 6(1), 1–6 (2015)
Jordanov, I., Petrov, N., Petrozziello, A.: Classifiers accuracy improvement based on missing data imputation. J. Artif. Intell. Soft Comput. Res. 8(1), 33–48 (2018)
Cohen, J., Cohen, P., West, S.G., Aiken, L.S.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Routledge, Abingdon (2013)
Sarro, F., Petrozziello, A., Harman, M.: Multi-objective software effort estimation. In: 2016 IEEE/ACM 38th IEEE International Conference on Software Engineering (ICSE), Austin (2016)
Osborne, J., Overbay, A.: Best practices in data cleaning. Best Pract. Quant. Methods 1(1), 205–213 (2008)
Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference (2011)
Frènay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 5(5), 845–869 (2014)
Valdiviezo, C., Van Aelst, S.: Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf. Sci. 311, 163–181 (2015)
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Cartwright, M., Shepperd, M.J., Song, Q.: Dealing with missing software project data. In: Proceedings of the 9th International Software Metrics Symposium (2003)
Batista, G., Monard, M.: A study of K-nearest neighbour as a model-based method to treat missing data. In: Argentine Symposium on Artificial Intelligence (2001)
Lee, M.C., Mitra, R.: Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput. Stat. Data Anal. 95(1), 24–38 (2016)
Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)
Bartlett, J., Seaman, S., White, I., Carpenter, J.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods Med. Res. 24(4), 462–487 (2015)
Oba, S., Sato, M.-A., Takemasa, I., Monden, M., Matsubara, K.-I., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003)
Petrozziello, A., Jordanov, I.: Column-wise guided data imputation. Proc. Comput. Sci. 108(1), 2282–2286 (2017)
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Pan, X.-Y., Tian, Y., Huang, Y., Shen, H.-B.: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics 97(5), 257–264 (2011)
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)
Chai, T., Draxler, R.: Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)
Whigham, P.A., Owen, C.A., Macdonell, S.G.: A baseline model for software effort estimation. ACM Trans. Softw. Eng. Methodol. (TOSEM) 24(3), 20 (2015)
Gòmez-Carracedo, M., Andrade, J., Lòpez-Mahìa, P., Muniategui, S., Prada, D.: A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr. Intell. Lab. Syst. 134(1), 23–33 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Petrozziello, A., Jordanov, I. (2019). Feature Based Multivariate Data Imputation. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2018. Lecture Notes in Computer Science(), vol 11331. Springer, Cham. https://doi.org/10.1007/978-3-030-13709-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-13709-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13708-3
Online ISBN: 978-3-030-13709-0
eBook Packages: Computer ScienceComputer Science (R0)