Abstract
With the evolution of new technologies, the production of digital data is constantly growing. It is thus necessary to develop data management strategies in order to handle the large-scale datasets. The data gathered through different sources, such as sensor networks, social media, business transactions, etc. is inherently uncertain due to noise, missing values, inconsistencies and other problems that impact the quality of big data analytics. One of the key challenges in this context is to detect and repair dirty data, i.e. data cleansing, and various techniques have been presented to solve this issue. However, to the best of our knowledge, there has not been any comprehensive review of data cleansing techniques for big data analytics. As such, a comprehensive and systematic study on the state-of-the-art mechanisms within the scope of the big data cleansing is done in this survey. Therefore, five categories to review these mechanisms are considered, which are machine learning-based, sample-based, expert-based, rule-based, and framework-based mechanisms. A number of articles are reviewed in each category. Furthermore, this paper denotes the advantages and disadvantages of the chosen data cleansing techniques and discusses the related parameters, comparing them in terms of scalability, efficiency, accuracy, and usability. Finally, some suggestions for further work are provided to improve the big data cleansing mechanisms in the future.
Similar content being viewed by others
Data availability
All data and results are reported in the paper.
References
Abedjan Z, Akcora CG, Ouzzani M, Papotti P, Stonebraker M (2015) Temporal rules discovery for web data cleaning. Proc VLDB Endow 9(4):336–347
Beheshti A, Vaghani K, Benatallah B, Tabebordbar A (2018) CrowdCorrect: a curation pipeline for social data cleansing and curation. International conference on advanced information systems engineering. Springer, Cham, pp 24–38
Cappiello C, Samá W, Vitali M (2018) Quality awareness for a successful big data exploitation. In: Proceedings of the 22nd International Database Engineering & Applications Symposium, pp 37-44
Chang WL, Grady N (2015) NIST big data interoperability framework: volume 1, big data definitions. No. special publication (NIST SP)-1500-1
Chu X, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Tang N, Ye Y (2015) KATARA: reliable data cleaning with knowledge bases and crowdsourcing. Proc VLDB Endow 8(12):1952–1955
Chu X, Ilyas IF, Krishnan S, Wang J (2016) Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp 2201-2206
De S, Hu Y, Meduri VV, Chen Y, Kambhampati S (2016) Bayeswipe: a scalable probabilistic framework for improving data quality. J Data Informn Qual (JDIQ) 8(1):1–30
Ding W, Cao Y (2016) A data cleaning method on massive spatio-temporal data. In: Proceedings of the Asia-Pacific Services Computing Conference, pp 173-182
Ding X, Qin S (2018) Iteratively modeling based cleansing interactively samples of big data. In: International Conference on Cloud Computing and Security, pp 601-612
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView 2007:1–16
García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Godinho TM, Lebre R, Almeida JR, Costa C (2019) Etl framework for real-time business intelligence over medical imaging repositories. J Digit Imaging 32(5):870–879
Hariharakrishnan J, Mohanavalli S, Kumar KS (2017) Survey of pre-processing techniques for mining big data. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), pp 1-5
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6(1):44
Ilyas IF (2016) Effective data cleaning with continuous evaluation. IEEE Data Eng Bull 39(2):38–46
Jesmeen M, Hossen J, Sayeed S, Ho C, Tawsif K, Rahman A, Arif E (2018) A survey on cleaning dirty data using machine learning paradigm for big data analytics. Indones J Electr Eng Comput Sci 10(3):1234–1243
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1215-1230
Khorshed MT, Sharma NA, Kumar K, Prasad M, Ali AS, Xiang Y (2015) Integrating internet-of-things with the power of cloud computing and the intelligence of big data analytics—a three layered approach. In: 2015 2nd Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE), pp 1-8
Kitchenham B (2004) Procedures for performing systematic reviews, vol 33. Keele, UK, pp 1–26
Klein S (2017) The world of big data and IoT. IoT solutions in Microsoft’s azure IoT suite. Springer, New York, pp 3–13
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endow 5(12):1878–1881
LeCun Y, Bengio Y (1995) The handbook of brain theory and neural networks. Convolutional networks for images, speech, and time series. MIT press, Cambridge
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proce IEEE 86(11):2278–2324
Lian F, Fu M, Ju X (2020) An improvement of data cleaning method for grain big data processing using task merging. J Comput Commun 8(3):1–19
Liu H, Tk AK, Thomas JP, Hou X (2016) Cleaning framework for bigdata: an interactive approach for data cleaning. In: Proceedings of IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp 174-181
Luján-Mora S, Palomar M (2001a) Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: International Conference on Web-Age Information Management, pp 191-202
Luján-Mora S, Palomar M (2001b) Reducing inconsistency in integrating data from different sources. In: Proceedings 2001b International Database Engineering and Applications Symposium, pp 209-218
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung Byers A (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, New York
Martinez-Mosquera D, Luján-Mora S, López G, Santos L (2017) Data cleaning technique for security logs based on Fellegi-Sunter theory. EuroSymposium on systems analysis and design. Springer, Cham, pp 3–12
Mayfield C, Neville J, Prabhakar S (2010) ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 75-86
Mezzanzanica M, Boselli R, Cesarini M, Mercorio F (2015) A model-based evaluation of data quality activities in KDD. Inf Process Manag 51(2):144–166
Müller H, Freytag J-C (2005) Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik
Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inform Sci 30(4):431–448
Ramzan S, Bajwa IS, Ramzan B, Anwar W (2019) Intelligent data engineering for migration to NoSQL based secure environments. IEEE Access 7:69042–69057
Ridzuan F, Zainon WMNW (2019) A review on data cleansing methods for big data. Procedia Comput Sci 161:731–738
Romero CDG, Barriga JKD, Molano JIR (2016) Big data meaning in the architecture of IoT for smart cities. In: International Conference on Data Mining and Big Data, pp 457-465
Saberi M, Hussain OK, Chang E (2019) Quality management of workers in an in-house crowdsourcing-based framework for deduplication of organizations’ databases. IEEE Access 7:90715–90730
Salloum S, Huang JZ, He Y (2019) Exploring and cleaning big data with random sample data blocks. J Big Data 6(1):45
Satish KR, Kavya N (2017) Hybrid optimization in big data: error detection and data repairing by big data cleaning using CSO-GSA. In: Proceedings of the International Cnference on Cognitive Computing and Information Processing, pp 258-273
Tae KH, Roh Y, Oh YH, Kim H, Whang SE (2019) Data cleaning for accurate, fair, and robust models: a big data-AI integration approach. In: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp 1-4
Wahyudi A, Kuk G, Janssen M (2018) A process pattern model for tackling and improving big data quality. Inform Syst Front 20(3):457–469
Wang J, Krishnan S, Franklin MJ, Goldberg K, Kraska T, Milo T (2014) A sample-and-clean framework for fast and accurate query processing on dirty data. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 469-480
Wang H, Li M, Bu Y, Li J, Gao H, Zhang J (2016) Cleanix: a parallel big data cleaning system. ACM SIGMOD Rec 44(4):35–40
Wang H, Ding X, Chen X, Li J, Gao H (2017) CleanCloud: cleaning big data on cloud. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp 2543-2546
Yakout M, Berti-Équille L, Elmagarmid AK (2013) Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 553-564
Zhang G, He B-J (2021) Towards green roof implementation: drivers, motivations, barriers and recommendations. Urban For Urban Green 58:126992
Funding
None.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest among authors.
Ethical approval
The submitted work is original and has not been published elsewhere in any form or language.
Informed consent
None.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hosseinzadeh, M., Azhir, E., Ahmed, O.H. et al. Data cleansing mechanisms and approaches for big data analytics: a systematic study. J Ambient Intell Human Comput 14, 99–111 (2023). https://doi.org/10.1007/s12652-021-03590-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-021-03590-2