Data cleansing mechanisms and approaches for big data analytics: a systematic study

Hosseinzadeh, Mehdi; Azhir, Elham; Ahmed, Omed Hassan; Ghafour, Marwan Yassin; Ahmed, Sarkar Hasan; Rahmani, Amir Masoud; Vo, Bay

doi:10.1007/s12652-021-03590-2

Data cleansing mechanisms and approaches for big data analytics: a systematic study

Original Research
Published: 17 November 2021

Volume 14, pages 99–111, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Mehdi Hosseinzadeh¹,
Elham Azhir²,
Omed Hassan Ahmed³,
Marwan Yassin Ghafour⁴,
Sarkar Hasan Ahmed⁵,
Amir Masoud Rahmani⁶ &
…
Bay Vo⁷

1402 Accesses
6 Citations
Explore all metrics

Abstract

With the evolution of new technologies, the production of digital data is constantly growing. It is thus necessary to develop data management strategies in order to handle the large-scale datasets. The data gathered through different sources, such as sensor networks, social media, business transactions, etc. is inherently uncertain due to noise, missing values, inconsistencies and other problems that impact the quality of big data analytics. One of the key challenges in this context is to detect and repair dirty data, i.e. data cleansing, and various techniques have been presented to solve this issue. However, to the best of our knowledge, there has not been any comprehensive review of data cleansing techniques for big data analytics. As such, a comprehensive and systematic study on the state-of-the-art mechanisms within the scope of the big data cleansing is done in this survey. Therefore, five categories to review these mechanisms are considered, which are machine learning-based, sample-based, expert-based, rule-based, and framework-based mechanisms. A number of articles are reviewed in each category. Furthermore, this paper denotes the advantages and disadvantages of the chosen data cleansing techniques and discusses the related parameters, comparing them in terms of scalability, efficiency, accuracy, and usability. Finally, some suggestions for further work are provided to improve the big data cleansing mechanisms in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Data Cleaning

Big Data Preprocessing Phase in Engendering Quality Data

Big Data: Concepts, Challenges and Applications

Data availability

All data and results are reported in the paper.

References

Abedjan Z, Akcora CG, Ouzzani M, Papotti P, Stonebraker M (2015) Temporal rules discovery for web data cleaning. Proc VLDB Endow 9(4):336–347
Article Google Scholar
Beheshti A, Vaghani K, Benatallah B, Tabebordbar A (2018) CrowdCorrect: a curation pipeline for social data cleansing and curation. International conference on advanced information systems engineering. Springer, Cham, pp 24–38
Google Scholar
Cappiello C, Samá W, Vitali M (2018) Quality awareness for a successful big data exploitation. In: Proceedings of the 22nd International Database Engineering & Applications Symposium, pp 37-44
Chang WL, Grady N (2015) NIST big data interoperability framework: volume 1, big data definitions. No. special publication (NIST SP)-1500-1
Chu X, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Tang N, Ye Y (2015) KATARA: reliable data cleaning with knowledge bases and crowdsourcing. Proc VLDB Endow 8(12):1952–1955
Article Google Scholar
Chu X, Ilyas IF, Krishnan S, Wang J (2016) Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp 2201-2206
De S, Hu Y, Meduri VV, Chen Y, Kambhampati S (2016) Bayeswipe: a scalable probabilistic framework for improving data quality. J Data Informn Qual (JDIQ) 8(1):1–30
Article Google Scholar
Ding W, Cao Y (2016) A data cleaning method on massive spatio-temporal data. In: Proceedings of the Asia-Pacific Services Computing Conference, pp 173-182
Ding X, Qin S (2018) Iteratively modeling based cleansing interactively samples of big data. In: International Conference on Cloud Computing and Security, pp 601-612
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Article MATH Google Scholar
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView 2007:1–16
Google Scholar
García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Article Google Scholar
Godinho TM, Lebre R, Almeida JR, Costa C (2019) Etl framework for real-time business intelligence over medical imaging repositories. J Digit Imaging 32(5):870–879
Article Google Scholar
Hariharakrishnan J, Mohanavalli S, Kumar KS (2017) Survey of pre-processing techniques for mining big data. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), pp 1-5
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6(1):44
Article Google Scholar
Ilyas IF (2016) Effective data cleaning with continuous evaluation. IEEE Data Eng Bull 39(2):38–46
Google Scholar
Jesmeen M, Hossen J, Sayeed S, Ho C, Tawsif K, Rahman A, Arif E (2018) A survey on cleaning dirty data using machine learning paradigm for big data analytics. Indones J Electr Eng Comput Sci 10(3):1234–1243
Google Scholar
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1215-1230
Khorshed MT, Sharma NA, Kumar K, Prasad M, Ali AS, Xiang Y (2015) Integrating internet-of-things with the power of cloud computing and the intelligence of big data analytics—a three layered approach. In: 2015 2nd Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE), pp 1-8
Kitchenham B (2004) Procedures for performing systematic reviews, vol 33. Keele, UK, pp 1–26
Google Scholar
Klein S (2017) The world of big data and IoT. IoT solutions in Microsoft’s azure IoT suite. Springer, New York, pp 3–13
Chapter Google Scholar
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endow 5(12):1878–1881
Article Google Scholar
LeCun Y, Bengio Y (1995) The handbook of brain theory and neural networks. Convolutional networks for images, speech, and time series. MIT press, Cambridge
Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proce IEEE 86(11):2278–2324
Article Google Scholar
Lian F, Fu M, Ju X (2020) An improvement of data cleaning method for grain big data processing using task merging. J Comput Commun 8(3):1–19
Article Google Scholar
Liu H, Tk AK, Thomas JP, Hou X (2016) Cleaning framework for bigdata: an interactive approach for data cleaning. In: Proceedings of IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp 174-181
Luján-Mora S, Palomar M (2001a) Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: International Conference on Web-Age Information Management, pp 191-202
Luján-Mora S, Palomar M (2001b) Reducing inconsistency in integrating data from different sources. In: Proceedings 2001b International Database Engineering and Applications Symposium, pp 209-218
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung Byers A (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, New York
Google Scholar
Martinez-Mosquera D, Luján-Mora S, López G, Santos L (2017) Data cleaning technique for security logs based on Fellegi-Sunter theory. EuroSymposium on systems analysis and design. Springer, Cham, pp 3–12
Google Scholar
Mayfield C, Neville J, Prabhakar S (2010) ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 75-86
Mezzanzanica M, Boselli R, Cesarini M, Mercorio F (2015) A model-based evaluation of data quality activities in KDD. Inf Process Manag 51(2):144–166
Article Google Scholar
Müller H, Freytag J-C (2005) Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik
Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inform Sci 30(4):431–448
Google Scholar
Ramzan S, Bajwa IS, Ramzan B, Anwar W (2019) Intelligent data engineering for migration to NoSQL based secure environments. IEEE Access 7:69042–69057
Article Google Scholar
Ridzuan F, Zainon WMNW (2019) A review on data cleansing methods for big data. Procedia Comput Sci 161:731–738
Article Google Scholar
Romero CDG, Barriga JKD, Molano JIR (2016) Big data meaning in the architecture of IoT for smart cities. In: International Conference on Data Mining and Big Data, pp 457-465
Saberi M, Hussain OK, Chang E (2019) Quality management of workers in an in-house crowdsourcing-based framework for deduplication of organizations’ databases. IEEE Access 7:90715–90730
Article Google Scholar
Salloum S, Huang JZ, He Y (2019) Exploring and cleaning big data with random sample data blocks. J Big Data 6(1):45
Article Google Scholar
Satish KR, Kavya N (2017) Hybrid optimization in big data: error detection and data repairing by big data cleaning using CSO-GSA. In: Proceedings of the International Cnference on Cognitive Computing and Information Processing, pp 258-273
Tae KH, Roh Y, Oh YH, Kim H, Whang SE (2019) Data cleaning for accurate, fair, and robust models: a big data-AI integration approach. In: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp 1-4
Wahyudi A, Kuk G, Janssen M (2018) A process pattern model for tackling and improving big data quality. Inform Syst Front 20(3):457–469
Article Google Scholar
Wang J, Krishnan S, Franklin MJ, Goldberg K, Kraska T, Milo T (2014) A sample-and-clean framework for fast and accurate query processing on dirty data. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 469-480
Wang H, Li M, Bu Y, Li J, Gao H, Zhang J (2016) Cleanix: a parallel big data cleaning system. ACM SIGMOD Rec 44(4):35–40
Article Google Scholar
Wang H, Ding X, Chen X, Li J, Gao H (2017) CleanCloud: cleaning big data on cloud. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp 2543-2546
Yakout M, Berti-Équille L, Elmagarmid AK (2013) Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 553-564
Zhang G, He B-J (2021) Towards green roof implementation: drivers, motivations, barriers and recommendations. Urban For Urban Green 58:126992
Article Google Scholar

Download references

Funding

None.

Author information

Authors and Affiliations

Pattern Recognition and Machine Learning Lab, Gachon University, 1342 Seongnamdaero, Sujeonggu, Seongnam, 13120, Republic of Korea
Mehdi Hosseinzadeh
Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Elham Azhir
Department of Information Technology, College of Science and Technology, University of Human Development, Sulaymaniyah, Iraq
Omed Hassan Ahmed
Department of Computer Science, College of Science, University of Halabja, Halabja, Iraq
Marwan Yassin Ghafour
Network Department, Sulaimani Polytechnic University, Sulaymaniyah, Iraq
Sarkar Hasan Ahmed
Future Technology Research Center, National Yunlin University of Science and Technology, Douliou, Yunlin, 64002, Taiwan
Amir Masoud Rahmani
Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City, Vietnam
Bay Vo

Authors

Mehdi Hosseinzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Elham Azhir
View author publications
You can also search for this author in PubMed Google Scholar
Omed Hassan Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Marwan Yassin Ghafour
View author publications
You can also search for this author in PubMed Google Scholar
Sarkar Hasan Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Amir Masoud Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Bay Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehdi Hosseinzadeh.

Ethics declarations

Conflict of interest

There is no conflict of interest among authors.

Ethical approval

The submitted work is original and has not been published elsewhere in any form or language.

Informed consent

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hosseinzadeh, M., Azhir, E., Ahmed, O.H. et al. Data cleansing mechanisms and approaches for big data analytics: a systematic study. J Ambient Intell Human Comput 14, 99–111 (2023). https://doi.org/10.1007/s12652-021-03590-2

Download citation

Received: 01 October 2020
Accepted: 28 October 2021
Published: 17 November 2021
Issue Date: January 2023
DOI: https://doi.org/10.1007/s12652-021-03590-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data cleansing mechanisms and approaches for big data analytics: a systematic study

Abstract

Access this article

Similar content being viewed by others

Big Data Cleaning

Big Data Preprocessing Phase in Engendering Quality Data

Big Data: Concepts, Challenges and Applications

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data cleansing mechanisms and approaches for big data analytics: a systematic study

Abstract

Access this article

Similar content being viewed by others

Big Data Cleaning

Big Data Preprocessing Phase in Engendering Quality Data

Big Data: Concepts, Challenges and Applications

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation