Skip to main content

Missing Data Imputation for Continuous Variables Based on Multivariate Adaptive Regression Splines

  • Conference paper
  • First Online:
  • 1023 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12344))

Abstract

The problem of missing data in a database is something that causes frequent difficulties for its processing and analysis. This research presents a new missing data methodology based on multivariate adaptive regression splines (MARS) for missing data imputation. The performance of the proposed method is checked using as input information a database created from the hourly records of environmental stations located in the city of Madrid (Spain). Data analyzed corresponds to hourly measurements from 10th February 2004 to 31st May 2010. The proposed methodology has three variants. The first of these makes use of all the available information in order to calculate different MARS models with the ability to predict missing information based on the available data. In the second case, the MARS models are trained after the removal of 1% of the most extreme cases according to Mahalanobis’ distances, as they are considered outliers. Finally, the third model proposed makes use of the information corresponding only to the previous month in order to calculate the MARS models for the missing data prediction. The results obtained outperformed those given by multivariate imputation by chained equations (MICE) when applied to the same data sets. For a data set with 20% of its information missing, the proposed algorithm outperforms MICE in RMSE values at least in 65.5% of cases, MAE in 75.2% and MAPE in 76%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  2. van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)

    Article  Google Scholar 

  3. Ordóñez Galán, C., Sánchez Lasheras, F., de Cos Juez, F.J., Bernardo Sánchez, A.: Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. J. Comput. Appl. Math. 311, 704–717 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  4. Honaker, J., King, G., Blackwell, M.: Amelia II: a program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)

    Article  Google Scholar 

  5. Wang, X., Li, A., Jiang, Z., Feng, H.: Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform. 7(1), 1 (2006)

    Article  Google Scholar 

  6. Stekhoven, D.J., Bühlmann, P.: Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  7. World Health Organization. Health, environment, and sustainable development. Airpollution. https://www.who.int/sustainable-development/cities/health-risks/air-pollution/en/. Accessed 07 Jan 2020

  8. Iglesias García, C., et al.: Effects of the economic crisis on demand due to mental disorders in Asturias: data from the asturias cumulative psychiatric case register (2000–2010). Actas Esp. Psiquiatr. 42, 108–115 (2014)

    Google Scholar 

  9. He, Y., Zaslavsky, A.M., Landrum, M.B., Harrington, D.P., Catalano, P.: Multiple imputation in a large-scale complex survey: a practical guide. Stat. Meth. Med. Res. 19(6), 1–18 (2009)

    MathSciNet  Google Scholar 

  10. Stuart, E.A., Azur, M., Frangakis, C.E., Leaf, P.J.: Practical imputation with large data sets: A case study of the children’s mental health initiative. Am. J. Epidemiol. 169, 1133–1139 (2009)

    Article  Google Scholar 

  11. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York (1987)

    Book  MATH  Google Scholar 

  12. Morris, T.P., Ian, R.W., Patrick, R.: Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 14, 75–87 (2014)

    Article  Google Scholar 

  13. Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  14. Scott, B.G.: Partition regression. J. Am. Stat. Assoc. 69(348), 945–947 (1974)

    Article  MATH  Google Scholar 

  15. Pérez-Pevida, E., et al.: Biomechanical consequences of the elastic properties of dental implant alloys on the supporting bone: finite element analysis. BioMed Res. Int., 1–9 (2016)

    Google Scholar 

  16. de Cos Juez, F.J., Sánchez Lasheras, F., Roqueñí, N., Osborn, J.: An ANN-based smart tomographic reconstructor in a dynamic environment. Sensors 12(7), 8895–8911 (2012)

    Article  Google Scholar 

  17. Sánchez Lasheras, F., de Cos Juez, F.J., Suárez Sánchez, A., Krzemień, A., Riesgo Fernán-dez, P.: Forecasting the COMEX copper spot price by means of neural networks and ARIMA models. Res. Policy 45, 37–43 (2015)

    Article  Google Scholar 

  18. Riesgo García, M.V., Krzemień, A., Manzanedo del Campo, M.A., Escanciano García-Miranda, C., Sánchez Lasheras, F.: Rare earth elements price forecasting by means of trans-genic time series developed with ARIMA models. Res. Policy 59, 95–102 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

Laura Bonavera, Luigi Toffolatti and Joaquín González-Nuevo acknowledge financial support from the PGC 2018 project PGC2018-101948-B-I00 (MICINN, FEDER) and PAPI-19-EMERG-11 (Universidad de Oviedo). Joaquín González-Nuevo acknowledges financial support from the Spanish MINECO for the ‘Ramon y Cajal’ fellowship (RYC-2013-13256). Susana del Carmen Fernández Menéndez ackowledges financial Support from PERMASNOW CTM2014-52021-R from the Spanish MINECO.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Sánchez Lasheras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sánchez Lasheras, F. et al. (2020). Missing Data Imputation for Continuous Variables Based on Multivariate Adaptive Regression Splines. In: de la Cal, E.A., Villar Flecha, J.R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2020. Lecture Notes in Computer Science(), vol 12344. Springer, Cham. https://doi.org/10.1007/978-3-030-61705-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61705-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61704-2

  • Online ISBN: 978-3-030-61705-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics