Abstract
Imputing missing data is still a challenge for mixed datasets containing variables of different nature such as continuous, count, ordinal, categorical, and binary variables. The recently introduced Mixed Deep Gaussian Mixture Models (MDGMM) explicitly handle such different variable types. MDGMMs learn continuous and low dimensional representations of mixed datasets that capture the inter-variable dependence structure. We propose a model inversion that uses the learned latent representation and maps it with the observed parts of the signal. Latent areas of interest are identified for each missing value using an optimization method and synthetic imputation values are drawn. This new procedure is called MI2AMI (Missing data Imputation using MIxed deep GAussian MIxture models). The approach is tested against state-of-the-art mixed data imputation algorithms based on chained equations, Random Forests, k-Nearest Neighbours, and Generative Adversarial Networks. Two missing values designs were tested, namely the Missing Completly at Random (MCAR) and Missing at Random (MAR) designs, with missing value rates ranging from 10% to 30%.
Granted by the Research Chair NINA under the aegis of the Risk Foundation, an initiative by BNP Cardif.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10, 5–26 (2016)
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3), 1–67 (2011). https://doi.org/10.18637/jss.v045.i03. https://www.jstatsoft.org/index.php/jss/article/view/v045i03
Cagnone, S., Viroli, C.: A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv. Stat. Anal. 98(1), 1–20 (2014)
Choudhury, A., Kosorok, M.R.: Missing data imputation for classification problems (2020)
Christoffersen, B., Clements, M., Humphreys, K., Kjellström, H.: Asymptotically exact and fast gaussian copula models for imputation of mixed data types (2021)
Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods. SIAM (2000)
Deng, G., Han, C., Matteson, D.S.: Learning to rank with missing data via generative adversarial networks. arXiv preprint arXiv:2011.02089 (2020)
Fuchs, R., Pommeret, D., Viroli, C.: Mixed deep gaussian mixture model: a clustering model for mixed datasets. Advances in Data Analysis and Classification, pp. 1–23 (2021)
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics, pp. 857–871 (1971)
Kowarik, A., Templ, M.: Imputation with the r package vim. J. Stat. Softw. 74(7), 1–16 (2016). https://doi.org/10.18637/jss.v074.i07. https://www.jstatsoft.org/index.php/jss/article/view/v074i07
Lee, D., Kim, J., Moon, W.J., Ye, J.C.: Collagan : Collaborative gan for missing image data imputation (2019)
Li, S.C.X., Jiang, B., Marlin, B.: Learning from incomplete data with generative adversarial networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=S1lDV3RcKm
Lim, T., Loh, W., Shih, Y.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40(3), 203–228 (2000)
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)
Moustaki, I.: A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br. J. Math. Stat. Psychol. 56(2), 337–357 (2003)
Moustaki, I., Knott, M.: Generalized latent trait models. Psychometrika 65(3), 391–411 (2000)
Murray, J.S., Reiter, J.P.: Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. J. Am. Stat. Assoc.111(516), 1466–1479 (2016). https://doi.org/10.1080/01621459.2016.117. https://ideas.repec.org/a/taf/jnlasa/v111y2016i516p1466-1479.html
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581
Shang, C., Palmer, A., Sun, J., Chen, K.S., Lu, J., Bi, J.: Vigan: Missing view imputation with generative adversarial networks (2017)
Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2011). https://doi.org/10.1093/bioinformatics/btr597
Viroli, C., McLachlan, G.J.: Deep gaussian mixture models. Stat. Comput. 29(1), 43–51 (2019)
Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 5689–5698. PMLR, 10–15 July 2018. https://proceedings.mlr.press/v80/yoon18a.html
Zhao, Y., Udell, M.: Missing value imputation for mixed data via gaussian copula (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fuchs, R., Pommeret, D., Stocksieker, S. (2023). MI2AMI: Missing Data Imputation Using Mixed Deep Gaussian Mixture Models. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2022. Lecture Notes in Computer Science, vol 13810. Springer, Cham. https://doi.org/10.1007/978-3-031-25599-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-25599-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25598-4
Online ISBN: 978-3-031-25599-1
eBook Packages: Computer ScienceComputer Science (R0)