skip to main content
research-article
Public Access

REMIAN: Real-Time and Error-Tolerant Missing Value Imputation

Published: 28 September 2020 Publication History

Abstract

Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.

References

[1]
A. Tero. 2010. Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics 11, 2 (2010), 253--264.
[2]
R. B. Hamed and C. Fazli. 2018. GOOWE: Geometrically optimum and online-weighted ensemble classifier for evolving data streams. ACM Transactions on Knowledge Discovery from Data 12, 2 (2018), 25:1–25:33.
[3]
B. Jyoti. 2007. Time series anomaly detection using multiple statistical models. US Patent 7,310,590.
[4]
C. Beidi and S. Anshumali. 2018. Densified winner take all (WTA) hashing for sparse datasets. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 906--916.
[5]
C. Carmela, F. Agostino, and P. Clara. 2019. Bursty event detection in twitter streams. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 41:1–41:28.
[6]
C. Huanhuan, T. Peter, R. Ali, and Y. Xin. 2013. Learning in the model space for cognitive fault diagnosis. IEEE Transactions on Neural Networks and Learning Systems 25, 1 (2013), 124--136.
[7]
C. Huanhuan, T. Peter, R. Ali, and Y. Xin. 2013. Model-based kernel for efficient time series analysis. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 392--400.
[8]
C. Nan, L. Chaoguang, Z. Qiuhan, L. Yu-Ru, Tand Xian, and W. Xidao. 2017. Voila: Visual anomaly detection and monitoring with streaming spatiotemporal data. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2017), 23--33.
[9]
C. Varun, B. Arindam, and K. Vipin. 2009. Anomaly detection: A survey. ACM Computing Surveys 41, 3 (2009), 1--58.
[10]
K. G. Derpanis. 2010. Overview of the RANSAC algorithm. Image Rochester NY 4, 1 (2010), 2--3.
[11]
A. F. Martin and C. B. Robert. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 6 (1981), 381--395.
[12]
G. Eshel. 2003. The yule walker equations for the AR coefficients. Internet Resource 2 (2003), 68--73.
[13]
G. João, Z. Indre, B. Albert, P. Mykola, and B. Abdelhamid. 2014. A survey on concept drift adaptation. ACM Computing Surveys 46, 4 (2014), 44:1–44:37.
[14]
G. Wensheng, C. W. L. Jerry, F. V. Philippe, C. Han-Chieh, and S. Y. Philip. 2019. A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 25:1–25:34.
[15]
G. Zhabiz, Z. Xingquan, H. Arthur, and C. Michael. 2020. Deep learning for user interest and response prediction in online display advertising. Data Science and Engineering 5, 1 (2020), 12--26.
[16]
J. H. David and S. M. Barbara. 2010. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environmental Modeling and Software 25, 9 (2010), 1014--1022.
[17]
M. H. Joseph. 2008. Quantitative data cleaning for large databases. Technical report, United Nations Economic Commission for Europe, 25.
[18]
R. V. Hogg, E. A. Tanis, and D. L. Zimmerman. 2010. Probability and Statistical Inference. Pearson/Prentice Hall, Upper saddle River, NJ, USA.
[19]
Y. J. Kumar and K. B. Santosh. 2011. Min max normalization based data perturbation method for privacy protection. International Journal of Computer 8 Communication Technology 2, 8 (2011), 45--50.
[20]
L. Nikolay, A. Saeed, and F. Lan. 2015. Generic and scalable framework for automated time-series anomaly detection. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1939--1947.
[21]
M. Chris, N. Jennifer, and P. Sunil. 2010. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 75--86.
[22]
M. Qian, G. Yu, L. Wang-Chien, and Y. Ge. 2019. Order-sensitive imputation for clustered missing values. IEEE Transactions on Knowledge and Data Engineering 31, 1 (2019), 166--180.
[23]
P. Gyuhae, C. R. Amanda, S. Hoon, and R. F. Charles. 2005. An outlier analysis framework for impedance-based structural health monitoring. Journal of Sound and Vibration 286, 1--2 (2005), 229--250.
[24]
P. Peter and S. Markus. 1995. A mathematica version of zeilberger’s algorithm for proving binomial coefficient identities. Journal of Symbolic Computation 20, 5--6 (1995), 673--698.
[25]
P. Spiros, S. Jimeng, and F. Christos. 2005. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st International Conference on Very Large Data Bases. 697--708.
[26]
F. F. Ribeiro Ramos. 2003. Forecasts of market shares from VAR and BVAR models: A comparison of their accuracy. International Journal of Forecasting 19, 1 (2003), 95--110.
[27]
T. S. Dominique, M. G. Jason, M. P. Paolo, and T. E. Stephen. 2017. Time series anomaly detection: Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data. Corr, abs/1708.03665, http://arxiv.org/abs/1708.03665.
[28]
S. Shaoxu, Z. Aoqian, W. Jianmin, and S. Y. Philip. 2015. SCREEN: Stream data cleaning under speed constraints. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 827--841.
[29]
S. Xiaoyuan, G. Russell, M. K. Taghi, and N. Amri. 2011. Using classifier-based nominal imputation to improve machine learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 124--135.
[30]
J. T. Daniel and R. L. S. Guy. 1986. Generalization of the matrix inversion lemma. Proceedings of the IEEE 74, 7 (1986), 1050--1052.
[31]
T. Jin, J. Bo, Z. Aihua, and L. Bin. 2012. Graph matching based on spectral embedding with missing value. Pattern Recognition 45, 10 (2012), 3768--3779.
[32]
D. V. Saverio, M. Ettore, P. Marco, M. Luca, and D. F. Girolamo. 2008. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical 129, 2 (2008), 750--757.
[33]
W. Heng and A. Zubin. 2015. Concept drift detection for streaming data. In Proceedings of the 2015 International Joint Conference on Neural Networks. 1--9.
[34]
W. Kevin, H. B. Michael, D. Anton, G. Johann, and M. Hannes. 2017. Continuous imputation of missing values in streams of pattern-determining time series. In Proceedings of the 20th International Conference on Extending Database Technology. 330--341.
[35]
Y. Byoung-Kee, S. D. Nikolaos, J. Theodore, H. V. Jagadish, F. Christos, and B. Alexandros. 2000. Online data mining for co-evolving time sequences. In Proceedings of 16th International Conference on Data Engineering. 13--22.
[36]
Y. Rose, L. Yaguang, S. Cyrus, D. Ugur, and L. Yan. 2017. Deep learning: A generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining. 777--785.
[37]
C. Y. Yang. 2010. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Institute Inc, Rockville, MD, 49, 1–11 (2010), 12.
[38]
Z. Aoqian, S. Shaoxu, S. Yu, and W. Jianmin. 2019. Learning individual models for imputation. In Proceedings of 2019 International Conference on Data Engineering. 160--171.
[39]
Z. Aoqian, S. Shaoxu, W. Jianmin, and S. Y. Philip. 2017. Time series data cleaning: From anomaly detection to anomaly repairing. Proceedings of the VLDB Endowment 10, 10 (2017), 1046--1057.
[40]
Z. Chengqi, Z. Xiaofeng, Z. Jilian, Q. Yongsong, and Z. Shichao. 2007. GBKII: An imputation method for missing values. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 1080--1087.
[41]
Z. Indre and H. Jaakko. 2015. Optimizing regression models for data streams with missing values. Machine Learning 99, 1 (2015), 47--73.
[42]
Z. Shichao. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35, 1 (2011), 123--133.
[43]
Z. Shichao, J. Zhi, and Z. Xiaofeng. 2011. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software 84, 3 (2011), 452--459.
[44]
Z. Shichao, Q. Zhenxing, X. L. Charles, and S. Shengli. 2005. “Missing is useful”: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17, 12 (2005), 1689--1693.
[45]
Z. Shichao, Z. Jilian, Z. Xiaofeng, Q. Yongsong, and Z. Chengqi. 2008. Missing value imputation based on data clustering. Transactions on Computational Science 1 (2008), 128--138.
[46]
Z. Xiaofeng, Y. Jianye, Z. Chengyuan, and Z. Shichao. 2019. Efficient utilization of missing data in cost-sensitive learning. IEEE Transactions on Knowledge and Data Engineering, Early Access (2019), 1--1.
[47]
Z. Xiaofeng, Z. Shichao, J. Zhi, Z. Zili, and X. Zhuoming. 2010. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 23, 1 (2010), 110--121.
[48]
Z. Xiaofeng, Z. Shichao, J. Zhi, Z. Zili, and X. Zhuoming. 2011. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 23, 1 (2011), 110--121.

Cited By

View all
  • (2025)Missing value replacement in strings and applicationsData Mining and Knowledge Discovery10.1007/s10618-024-01074-339:2Online publication date: 22-Jan-2025
  • (2024)Computing Random Forest-distances in the presence of missing dataACM Transactions on Knowledge Discovery from Data10.1145/365634518:7(1-18)Online publication date: 19-Jun-2024
  • (2024)Online updating mode learning for streaming datasetsJournal of Statistical Computation and Simulation10.1080/00949655.2024.2350552(1-13)Online publication date: 15-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 6
December 2020
376 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3427188
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2020
Accepted: 01 July 2020
Revised: 01 May 2020
Received: 01 December 2019
Published in TKDD Volume 14, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Missing value
  2. poor-quality streaming data
  3. real-time imputation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • China Postdoctoral Science Foundation
  • National Science Foundation
  • Liaoning Revitalization Talents Program
  • National Natural Science Foundation of China
  • Liaoning Collaborative Fund

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)11
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Missing value replacement in strings and applicationsData Mining and Knowledge Discovery10.1007/s10618-024-01074-339:2Online publication date: 22-Jan-2025
  • (2024)Computing Random Forest-distances in the presence of missing dataACM Transactions on Knowledge Discovery from Data10.1145/365634518:7(1-18)Online publication date: 19-Jun-2024
  • (2024)Online updating mode learning for streaming datasetsJournal of Statistical Computation and Simulation10.1080/00949655.2024.2350552(1-13)Online publication date: 15-May-2024
  • (2024)Streaming data cleaning based on speed changeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 1-Jan-2024
  • (2024)GIG: Graph Data Imputation With Graph Differential DependenciesDatabases Theory and Applications10.1007/978-981-96-1242-0_26(347-358)Online publication date: 17-Dec-2024
  • (2023)Data Quality in IoT-Based Air Quality Monitoring Systems: a Systematic Mapping StudyWater, Air, & Soil Pollution10.1007/s11270-023-06127-9234:4Online publication date: 3-Apr-2023
  • (2022)A Noise-Aware Multiple Imputation Algorithm for Missing DataMathematics10.3390/math1101007311:1(73)Online publication date: 25-Dec-2022
  • (2022)Robust epileptic seizure prediction with missing values using an improved denoising adversarial autoencoderProceedings of the 2022 4th International Conference on Image, Video and Signal Processing10.1145/3531232.3531255(157-164)Online publication date: 18-Mar-2022
  • (2022)Quality-Informed Process Mining: A Case for Standardised Data Quality AnnotationsACM Transactions on Knowledge Discovery from Data10.1145/351170716:5(1-47)Online publication date: 5-Apr-2022
  • (2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media