Abstract
The problem of missing data in a database is something that causes frequent difficulties for its processing and analysis. This research presents a new missing data methodology based on multivariate adaptive regression splines (MARS) for missing data imputation. The performance of the proposed method is checked using as input information a database created from the hourly records of environmental stations located in the city of Madrid (Spain). Data analyzed corresponds to hourly measurements from 10th February 2004 to 31st May 2010. The proposed methodology has three variants. The first of these makes use of all the available information in order to calculate different MARS models with the ability to predict missing information based on the available data. In the second case, the MARS models are trained after the removal of 1% of the most extreme cases according to Mahalanobis’ distances, as they are considered outliers. Finally, the third model proposed makes use of the information corresponding only to the previous month in order to calculate the MARS models for the missing data prediction. The results obtained outperformed those given by multivariate imputation by chained equations (MICE) when applied to the same data sets. For a data set with 20% of its information missing, the proposed algorithm outperforms MICE in RMSE values at least in 65.5% of cases, MAE in 75.2% and MAPE in 76%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Ordóñez Galán, C., Sánchez Lasheras, F., de Cos Juez, F.J., Bernardo Sánchez, A.: Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. J. Comput. Appl. Math. 311, 704–717 (2017)
Honaker, J., King, G., Blackwell, M.: Amelia II: a program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)
Wang, X., Li, A., Jiang, Z., Feng, H.: Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform. 7(1), 1 (2006)
Stekhoven, D.J., Bühlmann, P.: Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
World Health Organization. Health, environment, and sustainable development. Airpollution. https://www.who.int/sustainable-development/cities/health-risks/air-pollution/en/. Accessed 07 Jan 2020
Iglesias García, C., et al.: Effects of the economic crisis on demand due to mental disorders in Asturias: data from the asturias cumulative psychiatric case register (2000–2010). Actas Esp. Psiquiatr. 42, 108–115 (2014)
He, Y., Zaslavsky, A.M., Landrum, M.B., Harrington, D.P., Catalano, P.: Multiple imputation in a large-scale complex survey: a practical guide. Stat. Meth. Med. Res. 19(6), 1–18 (2009)
Stuart, E.A., Azur, M., Frangakis, C.E., Leaf, P.J.: Practical imputation with large data sets: A case study of the children’s mental health initiative. Am. J. Epidemiol. 169, 1133–1139 (2009)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York (1987)
Morris, T.P., Ian, R.W., Patrick, R.: Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 14, 75–87 (2014)
Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)
Scott, B.G.: Partition regression. J. Am. Stat. Assoc. 69(348), 945–947 (1974)
Pérez-Pevida, E., et al.: Biomechanical consequences of the elastic properties of dental implant alloys on the supporting bone: finite element analysis. BioMed Res. Int., 1–9 (2016)
de Cos Juez, F.J., Sánchez Lasheras, F., Roqueñí, N., Osborn, J.: An ANN-based smart tomographic reconstructor in a dynamic environment. Sensors 12(7), 8895–8911 (2012)
Sánchez Lasheras, F., de Cos Juez, F.J., Suárez Sánchez, A., Krzemień, A., Riesgo Fernán-dez, P.: Forecasting the COMEX copper spot price by means of neural networks and ARIMA models. Res. Policy 45, 37–43 (2015)
Riesgo García, M.V., Krzemień, A., Manzanedo del Campo, M.A., Escanciano García-Miranda, C., Sánchez Lasheras, F.: Rare earth elements price forecasting by means of trans-genic time series developed with ARIMA models. Res. Policy 59, 95–102 (2018)
Acknowledgements
Laura Bonavera, Luigi Toffolatti and Joaquín González-Nuevo acknowledge financial support from the PGC 2018 project PGC2018-101948-B-I00 (MICINN, FEDER) and PAPI-19-EMERG-11 (Universidad de Oviedo). Joaquín González-Nuevo acknowledges financial support from the Spanish MINECO for the ‘Ramon y Cajal’ fellowship (RYC-2013-13256). Susana del Carmen Fernández Menéndez ackowledges financial Support from PERMASNOW CTM2014-52021-R from the Spanish MINECO.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sánchez Lasheras, F. et al. (2020). Missing Data Imputation for Continuous Variables Based on Multivariate Adaptive Regression Splines. In: de la Cal, E.A., Villar Flecha, J.R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2020. Lecture Notes in Computer Science(), vol 12344. Springer, Cham. https://doi.org/10.1007/978-3-030-61705-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-61705-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61704-2
Online ISBN: 978-3-030-61705-9
eBook Packages: Computer ScienceComputer Science (R0)