research-article

Public Access

REMIAN: Real-Time and Error-Tolerant Missing Value Imputation

Authors:

Wang-Chien Lee,

Xindong WuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 14, Issue 6

Article No.: 77, Pages 1 - 38

https://doi.org/10.1145/3412364

Published: 28 September 2020 Publication History

All formats PDF

Abstract

Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.

References

[1]

A. Tero. 2010. Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics 11, 2 (2010), 253--264.

[2]

R. B. Hamed and C. Fazli. 2018. GOOWE: Geometrically optimum and online-weighted ensemble classifier for evolving data streams. ACM Transactions on Knowledge Discovery from Data 12, 2 (2018), 25:1–25:33.

[3]

B. Jyoti. 2007. Time series anomaly detection using multiple statistical models. US Patent 7,310,590.

[4]

C. Beidi and S. Anshumali. 2018. Densified winner take all (WTA) hashing for sparse datasets. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 906--916.

[5]

C. Carmela, F. Agostino, and P. Clara. 2019. Bursty event detection in twitter streams. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 41:1–41:28.

[6]

C. Huanhuan, T. Peter, R. Ali, and Y. Xin. 2013. Learning in the model space for cognitive fault diagnosis. IEEE Transactions on Neural Networks and Learning Systems 25, 1 (2013), 124--136.

[7]

C. Huanhuan, T. Peter, R. Ali, and Y. Xin. 2013. Model-based kernel for efficient time series analysis. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 392--400.

[8]

C. Nan, L. Chaoguang, Z. Qiuhan, L. Yu-Ru, Tand Xian, and W. Xidao. 2017. Voila: Visual anomaly detection and monitoring with streaming spatiotemporal data. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2017), 23--33.

[9]

C. Varun, B. Arindam, and K. Vipin. 2009. Anomaly detection: A survey. ACM Computing Surveys 41, 3 (2009), 1--58.

Digital Library

[10]

K. G. Derpanis. 2010. Overview of the RANSAC algorithm. Image Rochester NY 4, 1 (2010), 2--3.

[11]

A. F. Martin and C. B. Robert. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 6 (1981), 381--395.

Digital Library

[12]

G. Eshel. 2003. The yule walker equations for the AR coefficients. Internet Resource 2 (2003), 68--73.

[13]

G. João, Z. Indre, B. Albert, P. Mykola, and B. Abdelhamid. 2014. A survey on concept drift adaptation. ACM Computing Surveys 46, 4 (2014), 44:1–44:37.

[14]

G. Wensheng, C. W. L. Jerry, F. V. Philippe, C. Han-Chieh, and S. Y. Philip. 2019. A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 25:1–25:34.

[15]

G. Zhabiz, Z. Xingquan, H. Arthur, and C. Michael. 2020. Deep learning for user interest and response prediction in online display advertising. Data Science and Engineering 5, 1 (2020), 12--26.

[16]

J. H. David and S. M. Barbara. 2010. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environmental Modeling and Software 25, 9 (2010), 1014--1022.

Digital Library

[17]

M. H. Joseph. 2008. Quantitative data cleaning for large databases. Technical report, United Nations Economic Commission for Europe, 25.

[18]

R. V. Hogg, E. A. Tanis, and D. L. Zimmerman. 2010. Probability and Statistical Inference. Pearson/Prentice Hall, Upper saddle River, NJ, USA.

[19]

Y. J. Kumar and K. B. Santosh. 2011. Min max normalization based data perturbation method for privacy protection. International Journal of Computer 8 Communication Technology 2, 8 (2011), 45--50.

[20]

L. Nikolay, A. Saeed, and F. Lan. 2015. Generic and scalable framework for automated time-series anomaly detection. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1939--1947.

[21]

M. Chris, N. Jennifer, and P. Sunil. 2010. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 75--86.

[22]

M. Qian, G. Yu, L. Wang-Chien, and Y. Ge. 2019. Order-sensitive imputation for clustered missing values. IEEE Transactions on Knowledge and Data Engineering 31, 1 (2019), 166--180.

Digital Library

[23]

P. Gyuhae, C. R. Amanda, S. Hoon, and R. F. Charles. 2005. An outlier analysis framework for impedance-based structural health monitoring. Journal of Sound and Vibration 286, 1--2 (2005), 229--250.

[24]

P. Peter and S. Markus. 1995. A mathematica version of zeilberger’s algorithm for proving binomial coefficient identities. Journal of Symbolic Computation 20, 5--6 (1995), 673--698.

[25]

P. Spiros, S. Jimeng, and F. Christos. 2005. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st International Conference on Very Large Data Bases. 697--708.

[26]

F. F. Ribeiro Ramos. 2003. Forecasts of market shares from VAR and BVAR models: A comparison of their accuracy. International Journal of Forecasting 19, 1 (2003), 95--110.

[27]

T. S. Dominique, M. G. Jason, M. P. Paolo, and T. E. Stephen. 2017. Time series anomaly detection: Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data. Corr, abs/1708.03665, http://arxiv.org/abs/1708.03665.

[28]

S. Shaoxu, Z. Aoqian, W. Jianmin, and S. Y. Philip. 2015. SCREEN: Stream data cleaning under speed constraints. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 827--841.

[29]

S. Xiaoyuan, G. Russell, M. K. Taghi, and N. Amri. 2011. Using classifier-based nominal imputation to improve machine learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 124--135.

[30]

J. T. Daniel and R. L. S. Guy. 1986. Generalization of the matrix inversion lemma. Proceedings of the IEEE 74, 7 (1986), 1050--1052.

[31]

T. Jin, J. Bo, Z. Aihua, and L. Bin. 2012. Graph matching based on spectral embedding with missing value. Pattern Recognition 45, 10 (2012), 3768--3779.

Digital Library

[32]

D. V. Saverio, M. Ettore, P. Marco, M. Luca, and D. F. Girolamo. 2008. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical 129, 2 (2008), 750--757.

[33]

W. Heng and A. Zubin. 2015. Concept drift detection for streaming data. In Proceedings of the 2015 International Joint Conference on Neural Networks. 1--9.

[34]

W. Kevin, H. B. Michael, D. Anton, G. Johann, and M. Hannes. 2017. Continuous imputation of missing values in streams of pattern-determining time series. In Proceedings of the 20th International Conference on Extending Database Technology. 330--341.

[35]

Y. Byoung-Kee, S. D. Nikolaos, J. Theodore, H. V. Jagadish, F. Christos, and B. Alexandros. 2000. Online data mining for co-evolving time sequences. In Proceedings of 16th International Conference on Data Engineering. 13--22.

[36]

Y. Rose, L. Yaguang, S. Cyrus, D. Ugur, and L. Yan. 2017. Deep learning: A generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining. 777--785.

[37]

C. Y. Yang. 2010. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Institute Inc, Rockville, MD, 49, 1–11 (2010), 12.

[38]

Z. Aoqian, S. Shaoxu, S. Yu, and W. Jianmin. 2019. Learning individual models for imputation. In Proceedings of 2019 International Conference on Data Engineering. 160--171.

[39]

Z. Aoqian, S. Shaoxu, W. Jianmin, and S. Y. Philip. 2017. Time series data cleaning: From anomaly detection to anomaly repairing. Proceedings of the VLDB Endowment 10, 10 (2017), 1046--1057.

Digital Library

[40]

Z. Chengqi, Z. Xiaofeng, Z. Jilian, Q. Yongsong, and Z. Shichao. 2007. GBKII: An imputation method for missing values. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 1080--1087.

[41]

Z. Indre and H. Jaakko. 2015. Optimizing regression models for data streams with missing values. Machine Learning 99, 1 (2015), 47--73.

Digital Library

[42]

Z. Shichao. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35, 1 (2011), 123--133.

Digital Library

[43]

Z. Shichao, J. Zhi, and Z. Xiaofeng. 2011. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software 84, 3 (2011), 452--459.

Digital Library

[44]

Z. Shichao, Q. Zhenxing, X. L. Charles, and S. Shengli. 2005. “Missing is useful”: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17, 12 (2005), 1689--1693.

Digital Library

[45]

Z. Shichao, Z. Jilian, Z. Xiaofeng, Q. Yongsong, and Z. Chengqi. 2008. Missing value imputation based on data clustering. Transactions on Computational Science 1 (2008), 128--138.

[46]

Z. Xiaofeng, Y. Jianye, Z. Chengyuan, and Z. Shichao. 2019. Efficient utilization of missing data in cost-sensitive learning. IEEE Transactions on Knowledge and Data Engineering, Early Access (2019), 1--1.

[47]

Z. Xiaofeng, Z. Shichao, J. Zhi, Z. Zili, and X. Zhuoming. 2010. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 23, 1 (2010), 110--121.

[48]

Z. Xiaofeng, Z. Shichao, J. Zhi, Z. Zili, and X. Zhuoming. 2011. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 23, 1 (2011), 110--121.

Digital Library

Cited By

Bernardini GLiu CLoukides GMarchetti-Spaccamela APissis SStougie LSweering M(2025)Missing value replacement in strings and applicationsData Mining and Knowledge Discovery10.1007/s10618-024-01074-339:2Online publication date: 22-Jan-2025
https://doi.org/10.1007/s10618-024-01074-3
Bicego MCicalese F(2024)Computing Random Forest-distances in the presence of missing dataACM Transactions on Knowledge Discovery from Data10.1145/365634518:7(1-18)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3656345
Jiang RZhao Y(2024)Online updating mode learning for streaming datasetsJournal of Statistical Computation and Simulation10.1080/00949655.2024.2350552(1-13)Online publication date: 15-May-2024
https://doi.org/10.1080/00949655.2024.2350552
Show More Cited By

Index Terms

REMIAN: Real-Time and Error-Tolerant Missing Value Imputation

Recommendations

Missing data imputation by utilizing information within incomplete instances

This paper proposes to utilize information within incomplete instances (instances with missing values) when estimating missing values. Accordingly, a simple and efficient nonparametric iterative imputation algorithm, called the NIIA method, is designed ...
Data preprocessing issues for incomplete medical datasets

While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete - missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the ...
Imputation of Incomplete Data Based on Attribute Cross Fitting Model and Iterative Missing Value Variables
Advances in Neural Networks – ISNN 2020
Abstract
The problem of missing values is often encountered in tasks such as machine learning, and imputation of missing values has become an important research content in incomplete data analysis. In this paper, we propose an attribute cross fitting model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 6

December 2020

376 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3427188

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2020

Accepted: 01 July 2020

Revised: 01 May 2020

Received: 01 December 2019

Published in TKDD Volume 14, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

China Postdoctoral Science Foundation
National Science Foundation
Liaoning Revitalization Talents Program
National Natural Science Foundation of China
Liaoning Collaborative Fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
496
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)11

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bernardini GLiu CLoukides GMarchetti-Spaccamela APissis SStougie LSweering M(2025)Missing value replacement in strings and applicationsData Mining and Knowledge Discovery10.1007/s10618-024-01074-339:2Online publication date: 22-Jan-2025
https://doi.org/10.1007/s10618-024-01074-3
Bicego MCicalese F(2024)Computing Random Forest-distances in the presence of missing dataACM Transactions on Knowledge Discovery from Data10.1145/365634518:7(1-18)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3656345
Jiang RZhao Y(2024)Online updating mode learning for streaming datasetsJournal of Statistical Computation and Simulation10.1080/00949655.2024.2350552(1-13)Online publication date: 15-May-2024
https://doi.org/10.1080/00949655.2024.2350552
Wang HZhang ASong SWang J(2024)Streaming data cleaning based on speed changeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00778-023-00796-y
Hua JBewong MKwashie SRahman MHu JGuo XFeng Z(2024)GIG: Graph Data Imputation With Graph Differential DependenciesDatabases Theory and Applications10.1007/978-981-96-1242-0_26(347-358)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-1242-0_26
Buelvas JMúnera DTobón V. DAguirre JGaviria N(2023)Data Quality in IoT-Based Air Quality Monitoring Systems: a Systematic Mapping StudyWater, Air, & Soil Pollution10.1007/s11270-023-06127-9234:4Online publication date: 3-Apr-2023
https://doi.org/10.1007/s11270-023-06127-9
Li FSun HGu YYu G(2022)A Noise-Aware Multiple Imputation Algorithm for Missing DataMathematics10.3390/math1101007311:1(73)Online publication date: 25-Dec-2022
https://doi.org/10.3390/math11010073
Peng PZhang KWei H(2022)Robust epileptic seizure prediction with missing values using an improved denoising adversarial autoencoderProceedings of the 2022 4th International Conference on Image, Video and Signal Processing10.1145/3531232.3531255(157-164)Online publication date: 18-Mar-2022
https://dl.acm.org/doi/10.1145/3531232.3531255
Goel KLeemans SMartin NWynn M(2022)Quality-Informed Process Mining: A Case for Standardised Data Quality AnnotationsACM Transactions on Knowledge Discovery from Data10.1145/351170716:5(1-47)Online publication date: 5-Apr-2022
https://dl.acm.org/doi/10.1145/3511707
Dashora RBabu M(2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
https://doi.org/10.1007/978-981-19-3015-7_41
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents