Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods

Pham, Nhat-Hao; Vo, Khanh-Linh; Vu, Mai Anh; Nguyen, Thu; Riegler, Michael A.; Halvorsen, Pål; Nguyen, Binh T.

doi:10.1007/978-3-031-53302-0_8

Nhat-Hao Pham^14,15,
Khanh-Linh Vo^14,15,
Mai Anh Vu^14,15,
Thu Nguyen¹⁶,
Michael A. Riegler¹⁶,
Pål Halvorsen¹⁶ &
…
Binh T. Nguyen^14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

975 Accesses

Abstract

Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can seriously affect this important data visualization tool. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two randomly missing data and monotone missing data. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot under missing data. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. In addition, the most accurate technique for computing a correlation matrix (in terms of RMSE) does not always give the correlation plots that most resemble the one based on complete data (the ground truth). We recommend using ImputePCA [1] for small datasets and DPER [2] for moderate and large datasets when plotting the correlation matrix based on their performance in the experiments.

N.-H. Pham, K.-L. Vo and M. A. Vu—The three authors have equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Analysis and Visualization of Missing Value Patterns

Visualizing Missing Data: COVID-2019

High-dimensional large-scale mixed-type data imputation under missing at random

Article 02 January 2025

References

Josse, J., Husson, F.: missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw. 70, 1–31 (2016)
Article Google Scholar
Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)
Google Scholar
Nguyen, P., et al.: Faster imputation using singular value decomposition for sparse data. In: Asian Conference on Intelligent Information and Database Systems, pp. 135–146. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-5834-4_11
Lien, P.L., Do, T.T., Nguyen, T.: Data imputation for multivariate time-series data. In: 2023 15th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2023)
Google Scholar
Nguyen, T., Storås, A.M., Thambawita, V., Hicks, S.A., Halvorsen, P., Riegler, M.A.: Multimedia datasets: challenges and future possibilities. In: International Conference on Multimedia Modeling, pp. 711–717. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_58
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw., pp. 1–68 (2010)
Google Scholar
Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)
Stekhoven, D.J., Bühlmann, P.: Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
MathSciNet Google Scholar
Yoon, J., Jordon, J., van der Schaar, M.:GAIN: missing data imputation using generative adversarial nets. CoRR, abs/1806.02920 (2018)
Google Scholar
Spinelli, I., Scardapane, S., Uncini, A.: Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw. 129, 249–260 (2020)
Article Google Scholar
Nguyen, T., Nguyen, D.H.M., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)
Article MathSciNet Google Scholar
Nguyen, T., Phan, T.N., Hoang, V.H., Halvorsen, P., Riegler, M., Nguyen, B.: Efficient parameter estimation for missing data when many features are fully observed (2023)
Google Scholar
Kraus, M., et al.: Assessing 2D and 3D heatmaps for comparative analysis: an empirical study. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2020)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
Google Scholar
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L.: Principal components analysis based frameworks for efficient missing data imputation algorithms. In Asian Conference on Intelligent Information and Database Systems, pp. 254–266. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-42430-4_21
Do, T.T., et al.: Blockwise principal component analysis for monotone missing data imputation and dimensionality reduction. arXiv preprint arXiv:2305.06042 (2023)

Download references

Acknowledgments

We would love to thank AISIA Research Lab, SimulaMet (Oslo, Norway), the University of Science, and Vietnam National University Ho Chi Minh City (VNU-HCM) for supporting us under the grant number DS2023-18-01 during this project.

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, University of Science, Ho Chi Minh City, Vietnam
Nhat-Hao Pham, Khanh-Linh Vo, Mai Anh Vu & Binh T. Nguyen
Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam
Nhat-Hao Pham, Khanh-Linh Vo, Mai Anh Vu & Binh T. Nguyen
SimulaMet, Oslo, Norway
Thu Nguyen, Michael A. Riegler & Pål Halvorsen

Authors

Nhat-Hao Pham
View author publications
You can also search for this author in PubMed Google Scholar
Khanh-Linh Vo
View author publications
You can also search for this author in PubMed Google Scholar
Mai Anh Vu
View author publications
You can also search for this author in PubMed Google Scholar
Thu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Riegler
View author publications
You can also search for this author in PubMed Google Scholar
Pål Halvorsen
View author publications
You can also search for this author in PubMed Google Scholar
Binh T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Binh T. Nguyen .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pham, NH. et al. (2024). Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_8
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods