Skip to main content

Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

  • 975 Accesses

Abstract

Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can seriously affect this important data visualization tool. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two randomly missing data and monotone missing data. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot under missing data. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. In addition, the most accurate technique for computing a correlation matrix (in terms of RMSE) does not always give the correlation plots that most resemble the one based on complete data (the ground truth). We recommend using ImputePCA [1] for small datasets and DPER [2] for moderate and large datasets when plotting the correlation matrix based on their performance in the experiments.

N.-H. Pham, K.-L. Vo and M. A. Vu—The three authors have equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Josse, J., Husson, F.: missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw. 70, 1–31 (2016)

    Article  Google Scholar 

  2. Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)

    Google Scholar 

  3. Nguyen, P., et al.: Faster imputation using singular value decomposition for sparse data. In: Asian Conference on Intelligent Information and Database Systems, pp. 135–146. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-5834-4_11

  4. Lien, P.L., Do, T.T., Nguyen, T.: Data imputation for multivariate time-series data. In: 2023 15th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2023)

    Google Scholar 

  5. Nguyen, T., Storås, A.M., Thambawita, V., Hicks, S.A., Halvorsen, P., Riegler, M.A.: Multimedia datasets: challenges and future possibilities. In: International Conference on Multimedia Modeling, pp. 711–717. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_58

  6. van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw., pp. 1–68 (2010)

    Google Scholar 

  7. Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)

  8. Stekhoven, D.J., Bühlmann, P.: Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Google Scholar 

  9. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)

    MathSciNet  Google Scholar 

  10. Yoon, J., Jordon, J., van der Schaar, M.:GAIN: missing data imputation using generative adversarial nets. CoRR, abs/1806.02920 (2018)

    Google Scholar 

  11. Spinelli, I., Scardapane, S., Uncini, A.: Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw. 129, 249–260 (2020)

    Article  Google Scholar 

  12. Nguyen, T., Nguyen, D.H.M., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)

    Article  MathSciNet  Google Scholar 

  13. Nguyen, T., Phan, T.N., Hoang, V.H., Halvorsen, P., Riegler, M., Nguyen, B.: Efficient parameter estimation for missing data when many features are fully observed (2023)

    Google Scholar 

  14. Kraus, M., et al.: Assessing 2D and 3D heatmaps for comparative analysis: an empirical study. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2020)

    Google Scholar 

  15. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  16. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)

    Google Scholar 

  17. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)

  18. Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L.: Principal components analysis based frameworks for efficient missing data imputation algorithms. In Asian Conference on Intelligent Information and Database Systems, pp. 254–266. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-42430-4_21

  19. Do, T.T., et al.: Blockwise principal component analysis for monotone missing data imputation and dimensionality reduction. arXiv preprint arXiv:2305.06042 (2023)

Download references

Acknowledgments

We would love to thank AISIA Research Lab, SimulaMet (Oslo, Norway), the University of Science, and Vietnam National University Ho Chi Minh City (VNU-HCM) for supporting us under the grant number DS2023-18-01 during this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Binh T. Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pham, NH. et al. (2024). Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53302-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53301-3

  • Online ISBN: 978-3-031-53302-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics