A visual analysis approach for data imputation via multi-party tabular data correlation strategies

Zhu, Haiyang; Han, Dongming; Pan, Jiacheng; Wei, Yating; Feng, Yingchaojie; Weng, Luoxuan; Mao, Ketian; Xing, Yuankai; Lv, Jianshu; Wan, Qiucheng; Chen, Wei

doi:10.1631/FITEE.2300480

A visual analysis approach for data imputation via multi-party tabular data correlation strategies

Research Article
Published: 29 December 2023

Volume 25, pages 398–414, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Haiyang Zhu ORCID: orcid.org/0000-0002-4782-5654^1,2,
Dongming Han¹,
Jiacheng Pan¹,
Yating Wei³,
Yingchaojie Feng¹,
Luoxuan Weng¹,
Ketian Mao¹,
Yuankai Xing²,
Jianshu Lv²,
Qiucheng Wan² &
…
Wei Chen ORCID: orcid.org/0000-0002-8365-4741¹

144 Accesses
Explore all metrics

An Erratum to this article was published on 31 January 2024

This article has been updated

Abstract

Data imputation is an essential pre-processing task for data governance, aimed at filling in incomplete data. However, conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data, and they fail to achieve the best balance between accuracy and efficiency. In this paper, we present a novel visual analysis approach for data imputation. We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables. Then, we perform the initial imputation of incomplete data using correlated data entries from other tables. Additionally, we develop a visual analysis system to refine data imputation candidates. Our interactive system combines the multi-party data imputation approach with expert knowledge, allowing for a better understanding of the relational structure of the data. This significantly enhances the accuracy and efficiency of data imputation, thereby enhancing the quality of data governance and the intrinsic value of data assets. Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using their domain knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Eris: efficiently measuring discord in multidimensional sources

Article Open access 20 September 2023

Visualizing the behavior and some symmetry properties of Bayesian confirmation measures

Article 05 December 2016

DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Change history

31 January 2024
An Erratum to this paper has been published: https://doi.org/10.1631/FITEE.23e0480

References

Ahuja S, Roth M, Gangadharaiah R, et al., 2016. Using machine learning to accelerate data wrangling. Proc IEEE 16^th Int Conf on Data Mining Workshops, p.343–349. https://doi.org/10.1109/ICDMW.2016.0055
Arbesser C, Spechtenhauser F, Mühlbacher T, et al., 2017. Visplause: visual data quality assessment of many time series using plausibility checks. IEEE Trans Visual Comput Graph, 23(1):641–650. https://doi.org/10.1109/TVCG.2016.2598592
Article Google Scholar
Azur MJ, Stuart EA, Frangakis C, et al., 2011. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psych Res, 20(1):40–49. https://doi.org/10.1002/mpr.329
Article Google Scholar
Bernard J, Hutter M, Reinemuth H, et al., 2019. Visual-interactive preprocessing of multivariate time series data. Comput Graph Forum, 38(3):401–412. https://doi.org/10.1111/cgf.13698
Article Google Scholar
Bernhard J, Cella DF, Coates AS, et al., 1998. Missing quality of life data in cancer clinical trials: serious problems and challenges. Statist Med, 17(5–7):517–532. https://doi.org/10.1002/(SICI)1097-0258(19980315/15)17:5/7<517::AID-SIM799>3.0.CO;2-S
Article Google Scholar
Bögl M, Filzmoser P, Gschwandtner T, et al., 2015. Visually and statistically guided imputation of missing values in univariate seasonal time series. Proc IEEE Conf on Visual Analytics Science Technology, p.189–190. https://doi.org/10.1109/VAST.2015.7347672
Bonneau GP, Hege HC, Johnson CR, et al., 2014. Overview and state-of-the-art of uncertainty visualization. In: Hansen CD, Chen M, Johnson CR (Eds.), Scientific Visualization: Uncertainty, Multifield, Biomedical, and Scalable Visualization. Springer, London, UK, p.3–27. https://doi.org/10.1007/978-1-4471-6497-5_1
Chapter Google Scholar
Bors C, Gschwandtner T, Miksch S, 2015. QualityFlow: provenance generation from data quality. Proc EuroVIS Conf on Visualization Posters Track.
Bors C, Bögl M, Gschwandtner T, et al., 2017. Visual support for rastering of unequally spaced time series. Proc 10^th Int Symp on Visual Information Communication and Interaction, p.53–57. https://doi.org/10.1145/3105971.3105984
Buono P, Aris A, Plaisant C, et al., 2005. Interactive pattern search in time series. Proc SPIE 5669, Visualization and Data Analysis, p.175–186. https://doi.org/10.1117/12.587537
Chai XT, Gu HM, Li F, et al., 2020. Deep learning for irregularly and regularly missing data reconstruction. Sci Rep, 10(1):3302. https://doi.org/10.1038/s41598-020-59801-x
Article Google Scholar
Chen W, Zhang TY, Zhu HY, et al., 2021. Perspectives on cross-domain visual analysis of cyber-physical-social big data. Front Inform Technol Electron Eng, 22(12):1559–1564. https://doi.org/10.1631/FITEE.2100553
Article Google Scholar
Djurcilov S, Pang A, 1999. Visualizing gridded datasets with large number of missing values. Proc Visualization, p.405–408. https://doi.org/10.1109/VISUAL.1999.809916
Eaton C, Plaisant C, Drizd T, 2005. Visualizing missing data: classification and empirical study. Proc IFIP Int Conf on Human—Computer Interaction, p.861–872.
Emmanuel T, Maupong T, Mpoeleng D, et al., 2021. A survey on missing data in machine learning. J Big Data, 8(1):140. https://doi.org/10.1186/s40537-021-00516-9
Article Google Scholar
Enders CK, 2022. Applied Missing Data Analysis. Methodology in the Social Sciences Series (2^nd Ed.). Guilford Press, New York, USA.
Google Scholar
Fernstad SJ, Glen RC, 2014. Visual analysis of missing data—To see what isn’t there. Proc IEEE Conf on Visual Analytics Science Technology, p.249–250. https://doi.org/10.1109/VAST.2014.7042514
Furche T, Gottlob G, Libkin L, et al., 2016. Data wrangling for big data: challenges and opportunities. Proc 19^th Int Conf on Extending Database Technology, p.473–478. https://doi.org/10.5441/002/edbt.2016.44
Gao J, 2006. Adaptive interpolation algorithms for temporaloriented datasets. Proc 13^th Int Sympon Temporal Representation and Reasoning, p.145–151. https://doi.org/10.1109/TIME.2006.4
Githungo W, Otengi S, Wakhungu J, et al., 2016. Infilling monthly rain gauge data gaps with satellite estimates for ASAL of Kenya. Hydrology, 3(4):40. https://doi.org/10.3390/hydrology3040040
Article Google Scholar
Griethe H, Schumann H, 2006. The visualization of uncertain data: methods and problems. Proc SimVis, p.143–156.
Gschwandtner T, Gärtner J, Aigner W, et al., 2012. A taxonomy of dirty time-oriented data. Proc Int Conf on Availability, Reliability, and Security, p.58–72. https://doi.org/10.1007/978-3-642-32498-7_5
Gülensoy K, Gawrilow C, von Landesberger T, 2014. Visual exploration of dirty activity sensor and emotional state data from psychological experiments. Proc 14^th Int Conf on Knowledge Technologies and Data-Driven Business, Article 19. https://doi.org/10.1145/2637748.2638432
Gupta M, Soeny K, 2021. Algorithms for rapid digitalization of prescriptions. Visual Inform, 5(3):54–69. https://doi.org/10.1016/j.visinf.2021.07.002
Article Google Scholar
Harlim J, Jiang SW, Liang SW, et al., 2021. Machine learning for prediction with missing dynamics. J Comput Phys, 428:109922. https://doi.org/10.1016/j.jcp.2020.109922
Article MathSciNet Google Scholar
Huang G, Guo C, Kusner MJ, et al., 2016. Supervised word mover’s distance. Proc 30^th Int Conf on Neural Information Processing Systems, p.4869–4877.
Kamal A, Dhakal P, Javaid AY, et al., 2021. Recent advances and challenges in uncertainty visualization: a survey. J Visual, 24(5):861–890. https://doi.org/10.1007/s12650-021-00755-1
Article Google Scholar
Kandel S, Heer J, Plaisant C, et al., 2011. Research directions in data wrangling: visualizations and transformations for usable and credible data. Inform Visual, 10(4):271–288. https://doi.org/10.1177/1473871611415994
Google Scholar
Kang H, 2013. The prevention and handling of the missing data. Korean J Anesthesiol, 64(5):402–406. https://doi.org/10.4097/kjae.2013.64.5.402
Article Google Scholar
Kim W, Choi BJ, Hong EK, et al., 2003. A taxonomy of dirty data. Data Min Knowl Discov, 7(1):81–99. https://doi.org/10.1023/A:1021564703268
Article MathSciNet Google Scholar
Kök İ, Özdemir S, 2021. DeepMDP: a novel deep-learning-based missing data prediction protocol for IoT. IEEE Int Things J, 8(1):232–243. https://doi.org/10.1109/JIOT.2020.3003922
Article Google Scholar
Kusner M, Sun Y, Kolkin N, et al., 2015. From word embeddings to document distances. Proc 32^nd Int Conf on Machine Learning, p.957–966.
Lajeunesse MJ, 2013. Recovering missing or partial data from studies: a survey of conversions and imputations for meta-analysis. In: Koricheva J, Gurevitch J, Mengersen K (Eds.), Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press, Princeton, USA, p.195–206. https://doi.org/10.1515/9781400846184-015
Google Scholar
Little RJA, Rubin DB, 2002. Statistical Analysis with Missing Data (2^nd Ed.). John Wiley & Sons, New York, USA. https://doi.org/10.1002/9781119013563
Book Google Scholar
Liu YJ, Fang YJ, Zhu XM, 2010. Modeling of hydraulic turbine systems based on a Bayesian—Gaussian neural network driven by sliding window data. J Zhejiang Univ Sci C (Comput & Electron), 11(1):56–62. https://doi.org/10.1631/jzus.C0910176
Article Google Scholar
Luo Y, 2022. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform, 23(1):bbab489. https://doi.org/10.1093/bib/bbab489
Article Google Scholar
Marlin BM, 2008. Missing Data Problems in Machine Learning. PhD Thesis, University of Toronto, Toronto, Canada.
Google Scholar
Mazumder R, Hastie T, Tibshirani R, 2010. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res, 11:2287–2322.
MathSciNet Google Scholar
McCarthy JD, Graniero PA, 2006. A GIS-based borehole data management and 3D visualization system. Comput Geosci, 32(10):1699–1708. https://doi.org/10.1016/j.cageo.2006.03.006
Article Google Scholar
Miao XY, Wu YY, Chen L, et al., 2023. An experimental survey of missing data imputation algorithms. IEEE Trans Knowl Data Eng, 35(7):6630–6650. https://doi.org/10.1109/TKDE.2022.3186498
Google Scholar
Nijman SWJ, Leeuwenberg AM, Beekers I, et al., 2022. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol, 142:218–229. https://doi.org/10.1016/j.jclinepi.2021.11.023
Article Google Scholar
Palocsay SW, Markham IS, Markham SE, 2010. Utilizing and teaching data tools in Excel for exploratory analysis. J Bus Res, 63(2):191–206. https://doi.org/10.1016/j.jbusres.2009.03.008
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, et al., 2011. Scikit-learn: machine learning in Python. J Mach Learn Res, 12:2825–2830.
MathSciNet Google Scholar
Rässler S, 2004. Data fusion: identification problems, validity, and multiple imputation. Austr J Stat, 33(1–2):153–171.
Google Scholar
Raubenheimer J, 2017. Excel-lence in data visualization?: the use of Microsoft Excel for data visualization and the analysis of big data. In: Prodromou T (Ed.), Data Visualization and Statistical Literacy for Open and Big Data. IGI Global Information Science Reference, Hershey, Pennsylvania, USA, p.153–193. https://doi.org/10.4018/978-1-5225-2512-7.ch007
Chapter Google Scholar
Rubinsteyn A, Feldman S, 2016. Fancyimpute: an Imputation Library for Python (Version: 0.7.0). https://github.com/iskandr/fancyimpute
Scheffer J, 2002. Dealing with missing data. Res Lett Inform Math Sci, 3(1):153–160.
Google Scholar
Smith DM, 2003. The cost of lost data. J Contemp Bus Pract, 6(3):1–9.
Google Scholar
Stekhoven DJ, Bühlmann P, 2012. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118. https://doi.org/10.1093/bioinformatics/btr597
Article Google Scholar
Sun YJ, Li J, Chen SM, et al., 2022. A learning-based approach for efficient visualization construction. Visual Inform, 6(1):14–25. https://doi.org/10.1016/j.visinf.2022.01.001
Article Google Scholar
Swayne DF, Buja A, 1998. Missing data in interactive high-dimensional data visualization. Comput Stat, 13(1):15–26.
Google Scholar
Templ M, Alfons A, Filzmoser P, 2012. Exploring incomplete data using visualization techniques. Adv Data Anal Classif, 6(1):29–47. https://doi.org/10.1007/s11634-011-0102-y
Article MathSciNet Google Scholar
Troyanskaya O, Cantor M, Sherlock G, et al., 2001. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525. https://doi.org/10.1093/bioinformatics/17.6.520
Article Google Scholar
Turkay C, Lundervold A, Lundervold AJ, et al., 2012. Representative factor generation for the interactive visual analysis of high-dimensional data. IEEE Trans Visual Comput Graph, 18(12):2621–2630. https://doi.org/10.1109/TVCG.2012.256
Article Google Scholar
Twiddy R, Cavallo J, Shiri SM, 1994. Restorer: a visualization technique for handling missing data. Proc Visualization, p.212–216. https://doi.org/10.1109/VISUAL.1994.346317
Unwin A, Hawkins G, Hofmann H, et al., 1996. Interactive graphics for data sets with missing values—MANET. J Comput Graph Stat, 5(2):113–122. https://doi.org/10.1080/10618600.1996.10474700
Google Scholar
Wang HN, Liu N, Zhang YY, et al., 2020. Deep reinforcement learning: a survey. Front Inform Technol Electron Eng, 21(12):1726–1744. https://doi.org/10.1631/FITEE.1900533
Article Google Scholar
Wang XM, Wu ZL, Huang WQ, et al., 2023. VIS+AI: integrating visualization with artificial intelligence for efficient data analysis. Front Comput Sci, 17(6):176709. https://doi.org/10.1007/s11704-023-2691-y
Article Google Scholar
Wong BLW, Varga M, 2012. Black holes, keyholes and brown worms: challenges in sense making. Proc Human Factors Ergon Soc Annu Meet, 56(1):287–291. https://doi.org/10.1177/1071181312561067
Article Google Scholar
Wu LF, Yen IEH, Xu K, et al., 2018. Word mover’s embedding: from Word2Vec to document embedding. Proc Conf on Empirical Methods in Natural Language Processing, p.4524–4534. https://doi.org/10.18653/v1/D18-1482
Wu ZL, Chen W, Ma YX, et al., 2023. Explainable data transformation recommendation for automatic visualization. Front Inform Technol Electron Eng, 24(10): 1007–1027. https://doi.org/10.1631/FITEE.2200409
Article Google Scholar
Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551–1558. https://doi.org/10.1631/FITEE.2100463
Article Google Scholar
Yi XW, Zheng Y, Zhang JB, et al., 2016. ST-MVL: filling missing values in geo-sensory time series data. Proc 25^th Int Joint Conf on Artificial Intelligence, p.2704–2710.
Yin S, Wang G, Yang X, 2014. Robust PLS approach for KPI-related prediction and diagnosis against outliers and missing data. Int J Syst Sci, 45(7):1375–1382. https://doi.org/10.1080/00207721.2014.886136
Article Google Scholar
Zhang GF, Zhu ZH, Zhu SJ, et al., 2022. Towards a better understanding of the role of visualization in online learning: a review. Visual Inform, 6(4):22–33. https://doi.org/10.1016/j.visinf.2022.09.002
Article Google Scholar

Download references

Author information

Authors and Affiliations

The State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, China
Haiyang Zhu, Dongming Han, Jiacheng Pan, Yingchaojie Feng, Luoxuan Weng, Ketian Mao & Wei Chen
Wuchan Zhongda Digital Technology Co., Ltd., Hangzhou, 310020, China
Haiyang Zhu, Yuankai Xing, Jianshu Lv & Qiucheng Wan
Zhejiang Metals and Materials Co., Ltd., Hangzhou, 310005, China
Yating Wei

Authors

Haiyang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Dongming Han
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yating Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yingchaojie Feng
View author publications
You can also search for this author in PubMed Google Scholar
Luoxuan Weng
View author publications
You can also search for this author in PubMed Google Scholar
Ketian Mao
View author publications
You can also search for this author in PubMed Google Scholar
Yuankai Xing
View author publications
You can also search for this author in PubMed Google Scholar
Jianshu Lv
View author publications
You can also search for this author in PubMed Google Scholar
Qiucheng Wan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Haiyang ZHU conceptualized the main idea and led the research. Haiyang ZHU and Wei CHEN surveyed the relevant materials. All the authors had in-depth discussions; they drafted, revised, and finalized the paper.

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Haiyang ZHU, Dongmin HAN, Jiacheng PAN, Yating WEI, Yingchaojie FENG, Luoxuan WENG, Ketian MAO, Yuankai XING, Jianshu LV, Qiucheng WAN, and Wei CHEN declare that they have no conflict of interest.

Additional information

Project supported by the Key R&D “Pioneer” Tackling Plan Program of Zhejiang Province, China (No. 2023C01119), the “Ten Thousand Talents Plan” Science and Technology Innovation Leading Talent Program of Zhejiang Province, China (No. 2022R52044), the Major Standardization Pilot Projects for the Digital Economy (Digital Trade Sector) of Zhejiang Province, China (No. SJ-BZ/2023053), and the National Natural Science Foundation of China (No. 62132017)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, H., Han, D., Pan, J. et al. A visual analysis approach for data imputation via multi-party tabular data correlation strategies. Front Inform Technol Electron Eng 25, 398–414 (2024). https://doi.org/10.1631/FITEE.2300480

Download citation

Received: 17 July 2023
Accepted: 29 October 2023
Published: 29 December 2023
Issue Date: March 2024
DOI: https://doi.org/10.1631/FITEE.2300480

Key words

CLC number

TP391.4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A visual analysis approach for data imputation via multi-party tabular data correlation strategies

Abstract

Access this article

Similar content being viewed by others

Eris: efficiently measuring discord in multidimensional sources

Visualizing the behavior and some symmetry properties of Bayesian confirmation measures

DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets

Data availability

Change history

31 January 2024

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

A visual analysis approach for data imputation via multi-party tabular data correlation strategies

Abstract

Access this article

Similar content being viewed by others

Eris: efficiently measuring discord in multidimensional sources

Visualizing the behavior and some symmetry properties of Bayesian confirmation measures

DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets

Data availability

Change history

31 January 2024

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation