Skip to main content
Log in

Data cleansing mechanisms and approaches for big data analytics: a systematic study

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

With the evolution of new technologies, the production of digital data is constantly growing. It is thus necessary to develop data management strategies in order to handle the large-scale datasets. The data gathered through different sources, such as sensor networks, social media, business transactions, etc. is inherently uncertain due to noise, missing values, inconsistencies and other problems that impact the quality of big data analytics. One of the key challenges in this context is to detect and repair dirty data, i.e. data cleansing, and various techniques have been presented to solve this issue. However, to the best of our knowledge, there has not been any comprehensive review of data cleansing techniques for big data analytics. As such, a comprehensive and systematic study on the state-of-the-art mechanisms within the scope of the big data cleansing is done in this survey. Therefore, five categories to review these mechanisms are considered, which are machine learning-based, sample-based, expert-based, rule-based, and framework-based mechanisms. A number of articles are reviewed in each category. Furthermore, this paper denotes the advantages and disadvantages of the chosen data cleansing techniques and discusses the related parameters, comparing them in terms of scalability, efficiency, accuracy, and usability. Finally, some suggestions for further work are provided to improve the big data cleansing mechanisms in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

All data and results are reported in the paper.

References

  • Abedjan Z, Akcora CG, Ouzzani M, Papotti P, Stonebraker M (2015) Temporal rules discovery for web data cleaning. Proc VLDB Endow 9(4):336–347

    Article  Google Scholar 

  • Beheshti A, Vaghani K, Benatallah B, Tabebordbar A (2018) CrowdCorrect: a curation pipeline for social data cleansing and curation. International conference on advanced information systems engineering. Springer, Cham, pp 24–38

    Google Scholar 

  • Cappiello C, Samá W, Vitali M (2018) Quality awareness for a successful big data exploitation. In: Proceedings of the 22nd International Database Engineering & Applications Symposium, pp 37-44

  • Chang WL, Grady N (2015) NIST big data interoperability framework: volume 1, big data definitions. No. special publication (NIST SP)-1500-1

  • Chu X, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Tang N, Ye Y (2015) KATARA: reliable data cleaning with knowledge bases and crowdsourcing. Proc VLDB Endow 8(12):1952–1955

    Article  Google Scholar 

  • Chu X, Ilyas IF, Krishnan S, Wang J (2016) Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp 2201-2206

  • De S, Hu Y, Meduri VV, Chen Y, Kambhampati S (2016) Bayeswipe: a scalable probabilistic framework for improving data quality. J Data Informn Qual (JDIQ) 8(1):1–30

    Article  Google Scholar 

  • Ding W, Cao Y (2016) A data cleaning method on massive spatio-temporal data. In: Proceedings of the Asia-Pacific Services Computing Conference, pp 173-182

  • Ding X, Qin S (2018) Iteratively modeling based cleansing interactively samples of big data. In: International Conference on Cloud Computing and Security, pp 601-612

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  MATH  Google Scholar 

  • Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView 2007:1–16

    Google Scholar 

  • García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152

    Article  Google Scholar 

  • Godinho TM, Lebre R, Almeida JR, Costa C (2019) Etl framework for real-time business intelligence over medical imaging repositories. J Digit Imaging 32(5):870–879

    Article  Google Scholar 

  • Hariharakrishnan J, Mohanavalli S, Kumar KS (2017) Survey of pre-processing techniques for mining big data. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), pp 1-5

  • Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6(1):44

    Article  Google Scholar 

  • Ilyas IF (2016) Effective data cleaning with continuous evaluation. IEEE Data Eng Bull 39(2):38–46

    Google Scholar 

  • Jesmeen M, Hossen J, Sayeed S, Ho C, Tawsif K, Rahman A, Arif E (2018) A survey on cleaning dirty data using machine learning paradigm for big data analytics. Indones J Electr Eng Comput Sci 10(3):1234–1243

    Google Scholar 

  • Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1215-1230

  • Khorshed MT, Sharma NA, Kumar K, Prasad M, Ali AS, Xiang Y (2015) Integrating internet-of-things with the power of cloud computing and the intelligence of big data analytics—a three layered approach. In: 2015 2nd Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE), pp 1-8

  • Kitchenham B (2004) Procedures for performing systematic reviews, vol 33. Keele, UK, pp 1–26

    Google Scholar 

  • Klein S (2017) The world of big data and IoT. IoT solutions in Microsoft’s azure IoT suite. Springer, New York, pp 3–13

    Chapter  Google Scholar 

  • Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endow 5(12):1878–1881

    Article  Google Scholar 

  • LeCun Y, Bengio Y (1995) The handbook of brain theory and neural networks. Convolutional networks for images, speech, and time series. MIT press, Cambridge

    Google Scholar 

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proce IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Lian F, Fu M, Ju X (2020) An improvement of data cleaning method for grain big data processing using task merging. J Comput Commun 8(3):1–19

    Article  Google Scholar 

  • Liu H, Tk AK, Thomas JP, Hou X (2016) Cleaning framework for bigdata: an interactive approach for data cleaning. In: Proceedings of IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp 174-181

  • Luján-Mora S, Palomar M (2001a) Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: International Conference on Web-Age Information Management, pp 191-202

  • Luján-Mora S, Palomar M (2001b) Reducing inconsistency in integrating data from different sources. In: Proceedings 2001b International Database Engineering and Applications Symposium, pp 209-218

  • Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung Byers A (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, New York

    Google Scholar 

  • Martinez-Mosquera D, Luján-Mora S, López G, Santos L (2017) Data cleaning technique for security logs based on Fellegi-Sunter theory. EuroSymposium on systems analysis and design. Springer, Cham, pp 3–12

    Google Scholar 

  • Mayfield C, Neville J, Prabhakar S (2010) ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 75-86

  • Mezzanzanica M, Boselli R, Cesarini M, Mercorio F (2015) A model-based evaluation of data quality activities in KDD. Inf Process Manag 51(2):144–166

    Article  Google Scholar 

  • Müller H, Freytag J-C (2005) Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik

  • Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inform Sci 30(4):431–448

    Google Scholar 

  • Ramzan S, Bajwa IS, Ramzan B, Anwar W (2019) Intelligent data engineering for migration to NoSQL based secure environments. IEEE Access 7:69042–69057

    Article  Google Scholar 

  • Ridzuan F, Zainon WMNW (2019) A review on data cleansing methods for big data. Procedia Comput Sci 161:731–738

    Article  Google Scholar 

  • Romero CDG, Barriga JKD, Molano JIR (2016) Big data meaning in the architecture of IoT for smart cities. In: International Conference on Data Mining and Big Data, pp 457-465

  • Saberi M, Hussain OK, Chang E (2019) Quality management of workers in an in-house crowdsourcing-based framework for deduplication of organizations’ databases. IEEE Access 7:90715–90730

    Article  Google Scholar 

  • Salloum S, Huang JZ, He Y (2019) Exploring and cleaning big data with random sample data blocks. J Big Data 6(1):45

    Article  Google Scholar 

  • Satish KR, Kavya N (2017) Hybrid optimization in big data: error detection and data repairing by big data cleaning using CSO-GSA. In: Proceedings of the International Cnference on Cognitive Computing and Information Processing, pp 258-273

  • Tae KH, Roh Y, Oh YH, Kim H, Whang SE (2019) Data cleaning for accurate, fair, and robust models: a big data-AI integration approach. In: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp 1-4

  • Wahyudi A, Kuk G, Janssen M (2018) A process pattern model for tackling and improving big data quality. Inform Syst Front 20(3):457–469

    Article  Google Scholar 

  • Wang J, Krishnan S, Franklin MJ, Goldberg K, Kraska T, Milo T (2014) A sample-and-clean framework for fast and accurate query processing on dirty data. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 469-480

  • Wang H, Li M, Bu Y, Li J, Gao H, Zhang J (2016) Cleanix: a parallel big data cleaning system. ACM SIGMOD Rec 44(4):35–40

    Article  Google Scholar 

  • Wang H, Ding X, Chen X, Li J, Gao H (2017) CleanCloud: cleaning big data on cloud. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp 2543-2546

  • Yakout M, Berti-Équille L, Elmagarmid AK (2013) Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 553-564

  • Zhang G, He B-J (2021) Towards green roof implementation: drivers, motivations, barriers and recommendations. Urban For Urban Green 58:126992

    Article  Google Scholar 

Download references

Funding

None.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehdi Hosseinzadeh.

Ethics declarations

Conflict of interest

There is no conflict of interest among authors.

Ethical approval

The submitted work is original and has not been published elsewhere in any form or language.

Informed consent

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosseinzadeh, M., Azhir, E., Ahmed, O.H. et al. Data cleansing mechanisms and approaches for big data analytics: a systematic study. J Ambient Intell Human Comput 14, 99–111 (2023). https://doi.org/10.1007/s12652-021-03590-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03590-2

Keywords

Navigation