Abstract
Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can seriously affect this important data visualization tool. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two randomly missing data and monotone missing data. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot under missing data. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. In addition, the most accurate technique for computing a correlation matrix (in terms of RMSE) does not always give the correlation plots that most resemble the one based on complete data (the ground truth). We recommend using ImputePCA [1] for small datasets and DPER [2] for moderate and large datasets when plotting the correlation matrix based on their performance in the experiments.
N.-H. Pham, K.-L. Vo and M. A. Vu—The three authors have equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Josse, J., Husson, F.: missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw. 70, 1–31 (2016)
Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)
Nguyen, P., et al.: Faster imputation using singular value decomposition for sparse data. In: Asian Conference on Intelligent Information and Database Systems, pp. 135–146. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-5834-4_11
Lien, P.L., Do, T.T., Nguyen, T.: Data imputation for multivariate time-series data. In: 2023 15th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2023)
Nguyen, T., Storås, A.M., Thambawita, V., Hicks, S.A., Halvorsen, P., Riegler, M.A.: Multimedia datasets: challenges and future possibilities. In: International Conference on Multimedia Modeling, pp. 711–717. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_58
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw., pp. 1–68 (2010)
Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)
Stekhoven, D.J., Bühlmann, P.: Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
Yoon, J., Jordon, J., van der Schaar, M.:GAIN: missing data imputation using generative adversarial nets. CoRR, abs/1806.02920 (2018)
Spinelli, I., Scardapane, S., Uncini, A.: Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw. 129, 249–260 (2020)
Nguyen, T., Nguyen, D.H.M., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)
Nguyen, T., Phan, T.N., Hoang, V.H., Halvorsen, P., Riegler, M., Nguyen, B.: Efficient parameter estimation for missing data when many features are fully observed (2023)
Kraus, M., et al.: Assessing 2D and 3D heatmaps for comparative analysis: an empirical study. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2020)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L.: Principal components analysis based frameworks for efficient missing data imputation algorithms. In Asian Conference on Intelligent Information and Database Systems, pp. 254–266. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-42430-4_21
Do, T.T., et al.: Blockwise principal component analysis for monotone missing data imputation and dimensionality reduction. arXiv preprint arXiv:2305.06042 (2023)
Acknowledgments
We would love to thank AISIA Research Lab, SimulaMet (Oslo, Norway), the University of Science, and Vietnam National University Ho Chi Minh City (VNU-HCM) for supporting us under the grant number DS2023-18-01 during this project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pham, NH. et al. (2024). Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)