Skip to main content
Log in

Predicting missing values: a comparative study on non-parametric approaches for imputation

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. An empirical analysis focusing on credit information and Facebook data complements our investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Amro L, Pauly M (2017) Permuting incomplete paired data: a novel exact and asymptotic correct randomization test. J Stat Comput Simul 87(6):1148–1159

    Article  MathSciNet  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. The Wadsworth and Brooks-Cole Statistics-Probability Series. Taylor & Francis, Monterey

    MATH  Google Scholar 

  • Brunner E, Munzel U (2000) The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biometrical J 42(1):17–25

    Article  MathSciNet  Google Scholar 

  • Bujlow T, Riaz T, Pedersen JM (2012) A method for classification of network traffic based on C5.0 machine learning algorithm. In: International conference on computing, networking and communications. IEEE Press, pp 237–241

  • Chacón JE, Duong T, Wand MP (2011) Asymptotics for general multivariate kernel density derivative estimators. Stat Sin 21(2):807–840

    Article  MathSciNet  Google Scholar 

  • Chipman HA, George EI, McCulloch RE (2010) BART: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298

    Article  MathSciNet  Google Scholar 

  • Conversano C, Siciliano R (2009) Incremental tree-based missing data imputation with lexicographic ordering. J Classif 26(3):361–379

    Article  MathSciNet  Google Scholar 

  • Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning proceedings 1995. Elsevier, San Francisco, CA, pp 194–202

    Chapter  Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MathSciNet  Google Scholar 

  • Greenwell B, Boehmke B, Cunningham J, Developers G (2018) gbm: Generalized boosted regression models. https://CRAN.R-project.org/package=gbm, R package version 2.1.4

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Kaiser S, Dominik T, Leisch F (2011) Generating correlated ordinal random variables. Department of Statistics, University of Munich, Technical Reports, 94

  • Khan SS, Ahmad A, Mihailidis A (2018) Bootstrapping and multiple imputation ensemble approaches for missing data. arXiv preprint arXiv:180200154

  • Konietschke F, Harrar SW, Lange K, Brunner E (2012) Ranking procedures for matched pairs with missing data—asymptotic theory and a small sample approximation. Comput Stat Data Anal 56(5):1090–1102

    Article  MathSciNet  Google Scholar 

  • Konietschke F, Bathke A, Harrar S, Pauly M (2015) Parametric and nonparametric bootstrap methods for general MANOVA. J Multivar Anal 140:291–301

    Article  MathSciNet  Google Scholar 

  • Krishnamoorthy K, Lu F (2010) A parametric bootstrap solution to the MANOVA under heteroscedasticity. J Stat Comput Simul 80(8):873–887

    Article  MathSciNet  Google Scholar 

  • Kuhn M, Quinlan R (2018) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50, R package version 0.1.2

  • Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  • Loh WY (2009) Improving the precision of classification trees. Ann Appl Stat 3(4):1710–1737

    Article  MathSciNet  Google Scholar 

  • Loh WY, Eltinge J, Cho M, Li Y (2016) Classification and regression tree methods for incomplete data from sample surveys. arXiv preprint arXiv:160301631

  • Müller HG, Petersen A (2016) Density estimation including examples, Wiley StatsRef: Statistics Reference Online, pp 1–12. https://doi.org/10.1002/9781118445112.stat02808.pub2

  • R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Ramosaj B, Amro L, Pauly M (2018) A cautionary tale on using imputation methods for inference in matched pairs design. arXiv preprint arXiv:180606551

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, New York

    Book  Google Scholar 

  • Smaga Ł (2017) Bootstrap methods for multivariate hypothesis testing. Commun Stat Simul Comput 46(10):7654–7667

    Article  MathSciNet  Google Scholar 

  • Stekhoven DJ (2011) Using the missForest Package. Seminar für Statistik, ETH Zürich, Technical Report pp 1–11. https://stat.ethz.ch/education/semesters/ss2012/ams/paper/missForest_1.2.pdf

  • Stekhoven DJ, Bühlmann P (2011) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  Google Scholar 

  • Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483–501

    Article  MathSciNet  Google Scholar 

  • Sun K, Mou S, Qiu J, Wang T, Gao H (2018) Adaptive fuzzy control for non-triangular structural stochastic switched nonlinear systems with full state constraints. IEEE Trans Fuzzy Syst. https://doi.org/10.1109/TFUZZ.2018.2883374

    Article  Google Scholar 

  • Tan YV, Flannagan CA, Elliott MR (2018) “Robust-squared” imputation models using BART. arXiv preprint arXiv:180103147

  • Vach W (1994) Missing values: statistical theory and computational practice. In: Proceedings of the 25th conference on statistical computing. Physica Verlag, Heidelberg, Germany, pp 345–354

    Google Scholar 

  • Van Buuren S (2011) Multiple imputation of multilevel data. In: Handbook of advanced multilevel analysis, Routledge/Taylor & Francis, New York, NY, pp 173–196

  • Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67. https://www.jstatsoft.org/v45/i03/

  • Waljee AK, Mukherjee A, Signal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgind PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. https://doi.org/10.1136/bmjopen-2013-002847

    Article  Google Scholar 

  • Wand MP, Jones MC (1994) Multivariate plug-in bandwidth selection. Comput Stat 9(2):97–116

    MathSciNet  MATH  Google Scholar 

  • Xu J, Harrar SW (2012) Accurate mean comparisons for paired samples with missing data: an application to a smoking-cessation trial. Biometrical J 54(2):281–295

    Article  MathSciNet  Google Scholar 

  • Xu LW, Yang FQ, Abula A, Qin S (2013) A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances. J Multivar Anal 115:172–180

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Burim Ramosaj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are thankful to David Stillwell from Cambridge University and Michal Kosinsky from the Stanford Graduate School for providing us the Facebook data. We acknowledge the support of the Daimler AG. Moreover, we like to thank two anonymous expert referees for their valuable and insightful comments.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 284 KB)

Appendix

Appendix

Missing values have been inserted artificially under consideration of the various missing mechanisms. For the MCAR and MAR condition, a special kind of mechanism has been implemented, which will be described in the following:

  1. 1.

    Missing completely at random We replace values randomly with missing values. For every variable \(\mathbf {X}_{j}\), \(j = 1,\ldots ,p\), we assumed that \(R_{ij} {\mathop {\sim }\limits ^{iid}} Bernoulli(1-r)\), \(i = 1,\ldots ,n\) where \(r \in \{0.1, 0.2, 0.3\}\) is the overall missing rate based on \(n \cdot p\).

  2. 2.

    Missing at random We implement this mechanism by building dependency structures across missing values of subsequent variables using the logistic regression. First, randomly select \(j^{* }\in \{1,\ldots ,p\}\) as the initial index and assume that \(R_{ij^{*}} {\mathop {\sim }\limits ^{iid}} Bernoulli(1 - r)\), where \(r \in \{0.1, 0.2, \)\(0.3\}\) is the overall missing rate. The missing values for the subsequent variable \( \mathbf {X}_{j_{s}^{*}}\) are inserted using the observed components of \(\mathbf {X}_{j^{*}}\) as covariate values within a logistic regression model. The response variables are randomly generated in an upstream step to estimate model parameters. Therefore, let \(\mathbf {X}_{j^{*}}^{obs}\) be the sub-vector of observed components of \(\mathbf {X}_{j^{*}}\). We construct a training response by generating \(\tilde{R}_{ij_{s}^{*}} {\mathop {\sim }\limits ^{iid}} Bernoulli(1 - r)\) for all \(i \in \mathbf {i}_{j^{*}}^{obs}\) and set \(\{ \tilde{\mathbf {R}}_{j_{s}^{*}}, \mathbf {X}_{j^{*}}^{obs} \}\) as the training sample on which the logistic regression will be conducted. If \(\hat{p}_{i j_{s}^{*}} \) is the predicted probability of \((\tilde{R}_{ij_{s}^{*}} = 1 | X_{ij^{*}}),\)\(i \in \mathbf {i}_{j^{*}}^{obs} \), then for the observations \(k \in \mathbf {i}_{j^{*}}^{obs}\) with the \(\lfloor r \cdot n \rfloor \) - smallest values of \(\hat{p}_{i j_{s}^{*}}\), we set \(R_{kj_{s}^{*}} = 0\). The process is continued in a pairwise fashion until all variables have been treated.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ramosaj, B., Pauly, M. Predicting missing values: a comparative study on non-parametric approaches for imputation. Comput Stat 34, 1741–1764 (2019). https://doi.org/10.1007/s00180-019-00900-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-019-00900-3

Keywords

Navigation