Predicting missing values: a comparative study on non-parametric approaches for imputation

Ramosaj, Burim; Pauly, Markus

doi:10.1007/s00180-019-00900-3

Predicting missing values: a comparative study on non-parametric approaches for imputation

Original paper
Published: 08 June 2019

Volume 34, pages 1741–1764, (2019)
Cite this article

Computational Statistics Aims and scope Submit manuscript

1174 Accesses
22 Citations
Explore all metrics

Abstract

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. An empirical analysis focusing on credit information and Facebook data complements our investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Data Imputation and Its Effect on the Accuracy of Classification

An evaluation of methods to handle missing data in the context of latent variable interaction analysis: multiple imputation, maximum likelihood, and random forest algorithm

Article 11 August 2022

Tacksoo Shin, Jeffrey D. Long & Mark L. Davison

Feature Based Multivariate Data Imputation

References

Amro L, Pauly M (2017) Permuting incomplete paired data: a novel exact and asymptotic correct randomization test. J Stat Comput Simul 87(6):1148–1159
Article MathSciNet Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. The Wadsworth and Brooks-Cole Statistics-Probability Series. Taylor & Francis, Monterey
MATH Google Scholar
Brunner E, Munzel U (2000) The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biometrical J 42(1):17–25
Article MathSciNet Google Scholar
Bujlow T, Riaz T, Pedersen JM (2012) A method for classification of network traffic based on C5.0 machine learning algorithm. In: International conference on computing, networking and communications. IEEE Press, pp 237–241
Chacón JE, Duong T, Wand MP (2011) Asymptotics for general multivariate kernel density derivative estimators. Stat Sin 21(2):807–840
Article MathSciNet Google Scholar
Chipman HA, George EI, McCulloch RE (2010) BART: Bayesian additive regression trees. Ann Appl Stat 4(1):266–298
Article MathSciNet Google Scholar
Conversano C, Siciliano R (2009) Incremental tree-based missing data imputation with lexicographic ordering. J Classif 26(3):361–379
Article MathSciNet Google Scholar
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning proceedings 1995. Elsevier, San Francisco, CA, pp 194–202
Chapter Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Article MathSciNet Google Scholar
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Article MathSciNet Google Scholar
Greenwell B, Boehmke B, Cunningham J, Developers G (2018) gbm: Generalized boosted regression models. https://CRAN.R-project.org/package=gbm, R package version 2.1.4
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York
Book Google Scholar
Kaiser S, Dominik T, Leisch F (2011) Generating correlated ordinal random variables. Department of Statistics, University of Munich, Technical Reports, 94
Khan SS, Ahmad A, Mihailidis A (2018) Bootstrapping and multiple imputation ensemble approaches for missing data. arXiv preprint arXiv:180200154
Konietschke F, Harrar SW, Lange K, Brunner E (2012) Ranking procedures for matched pairs with missing data—asymptotic theory and a small sample approximation. Comput Stat Data Anal 56(5):1090–1102
Article MathSciNet Google Scholar
Konietschke F, Bathke A, Harrar S, Pauly M (2015) Parametric and nonparametric bootstrap methods for general MANOVA. J Multivar Anal 140:291–301
Article MathSciNet Google Scholar
Krishnamoorthy K, Lu F (2010) A parametric bootstrap solution to the MANOVA under heteroscedasticity. J Stat Comput Simul 80(8):873–887
Article MathSciNet Google Scholar
Kuhn M, Quinlan R (2018) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50, R package version 0.1.2
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Book Google Scholar
Loh WY (2009) Improving the precision of classification trees. Ann Appl Stat 3(4):1710–1737
Article MathSciNet Google Scholar
Loh WY, Eltinge J, Cho M, Li Y (2016) Classification and regression tree methods for incomplete data from sample surveys. arXiv preprint arXiv:160301631
Müller HG, Petersen A (2016) Density estimation including examples, Wiley StatsRef: Statistics Reference Online, pp 1–12. https://doi.org/10.1002/9781118445112.stat02808.pub2
R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Ramosaj B, Amro L, Pauly M (2018) A cautionary tale on using imputation methods for inference in matched pairs design. arXiv preprint arXiv:180606551
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article MathSciNet Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, New York
Book Google Scholar
Smaga Ł (2017) Bootstrap methods for multivariate hypothesis testing. Commun Stat Simul Comput 46(10):7654–7667
Article MathSciNet Google Scholar
Stekhoven DJ (2011) Using the missForest Package. Seminar für Statistik, ETH Zürich, Technical Report pp 1–11. https://stat.ethz.ch/education/semesters/ss2012/ams/paper/missForest_1.2.pdf
Stekhoven DJ, Bühlmann P (2011) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Article Google Scholar
Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483–501
Article MathSciNet Google Scholar
Sun K, Mou S, Qiu J, Wang T, Gao H (2018) Adaptive fuzzy control for non-triangular structural stochastic switched nonlinear systems with full state constraints. IEEE Trans Fuzzy Syst. https://doi.org/10.1109/TFUZZ.2018.2883374
Article Google Scholar
Tan YV, Flannagan CA, Elliott MR (2018) “Robust-squared” imputation models using BART. arXiv preprint arXiv:180103147
Vach W (1994) Missing values: statistical theory and computational practice. In: Proceedings of the 25th conference on statistical computing. Physica Verlag, Heidelberg, Germany, pp 345–354
Google Scholar
Van Buuren S (2011) Multiple imputation of multilevel data. In: Handbook of advanced multilevel analysis, Routledge/Taylor & Francis, New York, NY, pp 173–196
Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67. https://www.jstatsoft.org/v45/i03/
Waljee AK, Mukherjee A, Signal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgind PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. https://doi.org/10.1136/bmjopen-2013-002847
Article Google Scholar
Wand MP, Jones MC (1994) Multivariate plug-in bandwidth selection. Comput Stat 9(2):97–116
MathSciNet MATH Google Scholar
Xu J, Harrar SW (2012) Accurate mean comparisons for paired samples with missing data: an application to a smoking-cessation trial. Biometrical J 54(2):281–295
Article MathSciNet Google Scholar
Xu LW, Yang FQ, Abula A, Qin S (2013) A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances. J Multivar Anal 115:172–180
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Statistics, Technical University of Dortmund, Dortmund, Germany
Burim Ramosaj & Markus Pauly

Authors

Burim Ramosaj
View author publications
You can also search for this author in PubMed Google Scholar
Markus Pauly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Burim Ramosaj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are thankful to David Stillwell from Cambridge University and Michal Kosinsky from the Stanford Graduate School for providing us the Facebook data. We acknowledge the support of the Daimler AG. Moreover, we like to thank two anonymous expert referees for their valuable and insightful comments.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 284 KB)

Appendix

Missing values have been inserted artificially under consideration of the various missing mechanisms. For the MCAR and MAR condition, a special kind of mechanism has been implemented, which will be described in the following:

1.
Missing completely at random We replace values randomly with missing values. For every variable \(\mathbf {X}_{j}\), \(j = 1,\ldots ,p\), we assumed that \(R_{ij} {\mathop {\sim }\limits ^{iid}} Bernoulli(1-r)\), \(i = 1,\ldots ,n\) where \(r \in \{0.1, 0.2, 0.3\}\) is the overall missing rate based on \(n \cdot p\).
2.
Missing at random We implement this mechanism by building dependency structures across missing values of subsequent variables using the logistic regression. First, randomly select \(j^{* }\in \{1,\ldots ,p\}\) as the initial index and assume that \(R_{ij^{*}} {\mathop {\sim }\limits ^{iid}} Bernoulli(1 - r)\), where \(r \in \{0.1, 0.2, \)\(0.3\}\) is the overall missing rate. The missing values for the subsequent variable \( \mathbf {X}_{j_{s}^{*}}\) are inserted using the observed components of \(\mathbf {X}_{j^{*}}\) as covariate values within a logistic regression model. The response variables are randomly generated in an upstream step to estimate model parameters. Therefore, let \(\mathbf {X}_{j^{*}}^{obs}\) be the sub-vector of observed components of \(\mathbf {X}_{j^{*}}\). We construct a training response by generating \(\tilde{R}_{ij_{s}^{*}} {\mathop {\sim }\limits ^{iid}} Bernoulli(1 - r)\) for all \(i \in \mathbf {i}_{j^{*}}^{obs}\) and set \(\{ \tilde{\mathbf {R}}_{j_{s}^{*}}, \mathbf {X}_{j^{*}}^{obs} \}\) as the training sample on which the logistic regression will be conducted. If \(\hat{p}_{i j_{s}^{*}} \) is the predicted probability of \((\tilde{R}_{ij_{s}^{*}} = 1 | X_{ij^{*}}),\)\(i \in \mathbf {i}_{j^{*}}^{obs} \), then for the observations \(k \in \mathbf {i}_{j^{*}}^{obs}\) with the \(\lfloor r \cdot n \rfloor \) - smallest values of \(\hat{p}_{i j_{s}^{*}}\), we set \(R_{kj_{s}^{*}} = 0\). The process is continued in a pairwise fashion until all variables have been treated.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramosaj, B., Pauly, M. Predicting missing values: a comparative study on non-parametric approaches for imputation. Comput Stat 34, 1741–1764 (2019). https://doi.org/10.1007/s00180-019-00900-3

Download citation

Received: 16 April 2018
Accepted: 25 May 2019
Published: 08 June 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s00180-019-00900-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting missing values: a comparative study on non-parametric approaches for imputation

Abstract

Access this article

Similar content being viewed by others

Missing Data Imputation and Its Effect on the Accuracy of Classification

An evaluation of methods to handle missing data in the context of latent variable interaction analysis: multiple imputation, maximum likelihood, and random forest algorithm

Feature Based Multivariate Data Imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting missing values: a comparative study on non-parametric approaches for imputation

Abstract

Access this article

Similar content being viewed by others

Missing Data Imputation and Its Effect on the Accuracy of Classification

An evaluation of methods to handle missing data in the context of latent variable interaction analysis: multiple imputation, maximum likelihood, and random forest algorithm

Feature Based Multivariate Data Imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation