skip to main content
research-article

BIGQA: Declarative Big Data Quality Assessment

Published:22 August 2023Publication History
Skip Abstract Section

Abstract

In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.

REFERENCES

  1. [1] Abran Alain, Al-Qutaish Rafa E., Desharnais Jean-Marc, and Habra Naji. 2005. An information model for software quality measurement with ISO standards. In Proceedings of the International Conference on Software Development (SWDC-REK’05), Reykjavik, 104116.Google ScholarGoogle Scholar
  2. [2] Ardagna D., Cappiello C., Samá Walter, and Vitali M.. 2018. Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89 (2018), 548562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Barzdins J., Zarins A., Cerans Karlis, Kalnins A., Rencis Edgars, Lace L., Liepins Renars, and Sprogis A.. 2007. GrTP: Transformation based graphical tool building platform. In MDDAUI.Google ScholarGoogle Scholar
  4. [4] Batini C., Rula A., Scannapieco M., and Viscusi G.. 2015. From data quality to big data quality. J. Database Manag. 26, 1 (2015), 6082.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bernstein Philip A., Madhavan Jayant, and Rahm Erhard. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bertossi L.. 2017. Some Declarative Approaches to Data Quality. Retrieved February 3, 2021 from http://people.scs.carleton.ca/bertossi/talks/tutBicod17.pdf.Google ScholarGoogle Scholar
  7. [7] Bertossi L. and Bravo L.. 2013. Generic and declarative approaches to data quality management. Handbook of Data Quality: Research and Practice, 181–211.Google ScholarGoogle Scholar
  8. [8] Bicevska Zane, Bicevskis Janis, and Oditis Ivo. 2017. Domain-specific characteristics of data quality. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS’17)9991003.Google ScholarGoogle Scholar
  9. [9] Bicevska Zane, Bicevskis J., and Oditis Ivo. 2017. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference (ISM’17, Held as Part of FedCSIS, Prague, Czech Republic, September 3-6, 2017), Extended Selected Papers 15. Springer, 194–211.Google ScholarGoogle Scholar
  10. [10] Bicevskis J., Bicevska Zane, and Karnitis G.. 2017. Executable data quality models. Procedia Computer Science 104 (2017), 138145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Bicevskis J., Bicevska Zane, Nikiforova A., and Oditis Ivo. 2018. An approach to data quality evaluation. In 2018 5th International Conference on Social Networks Analysis, Management and Security (SNAMS’18)196201.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chandel Amit, Hassanzadeh Oktie, Koudas Nick, Sadoghi Mohammad, and Srivastava Divesh. 2007. Benchmarking declarative approximate selection predicates. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 353–364.Google ScholarGoogle Scholar
  13. [13] Clarke Roger. 2014. Quality Factors in Big Data and Big Data Analytics. Retrieved December 30, 2019 from http://www.rogerclarke.com/EC/BDQF.html.Google ScholarGoogle Scholar
  14. [14] Cormode Graham and Duffield Nick G.. 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1975–1975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Corporation Microsoft. 2018. SQL Server Integration Services. Retrieved August 25, 2020 from https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-ver15.Google ScholarGoogle Scholar
  16. [16] Corporation Oracle. 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper]. Technical Report. Oracle Corporation. Retrieved August 1, 2020 from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.Google ScholarGoogle Scholar
  17. [17] Costabel Peralta and Carmen V.. 2006. Data Quality Evaluation in Data Integration Systems. Ph. D. Dissertation. Université de Versailles-Saint Quentin en Yvelines; Université de la République dÚruguay.Google ScholarGoogle Scholar
  18. [18] Fadlallah Hadi, Taher Yehia, Haque Rafiqul, and Jaber Ali H.. 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In BDCSIntell, 52–56.Google ScholarGoogle Scholar
  19. [19] Fadlallah Hadi, Taher Yehia, and Jaber Ali H.. 2018. RaDEn: A scalable and efficient radiation data engineering. In BDCSIntell, 89–93.Google ScholarGoogle Scholar
  20. [20] Finkel Jenny Rose, Grenager Trond, and Manning Christopher D.. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363–370.Google ScholarGoogle Scholar
  21. [21] Friedl Jeffrey E. F.. 2006. Mastering Regular Expressions (3rd ed.). O’Reilly Media, Inc., Sebastopol, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Galhardas H., Florescu D., Shasha D., Simon E., and Saita Cristian-Augustin. 2001. Declarative data cleaning: Language, model, and algorithms. In VLDB.Google ScholarGoogle Scholar
  23. [23] Gao Jerry Zeyu, Xie Chunli, and Tao Chuanqi. 2016. Big data validation and quality assurance – Issues, challenges, and needs. In 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE’16)433441.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Gartner. 2017. How to Create a Business Case for Data Quality Improvement. Retrieved May 1, 2021 from https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/.Google ScholarGoogle Scholar
  25. [25] Ge Mouzhi and Helfert Markus. 2007. A review of information quality research-develop a research agenda. In The International Conference on Information Quality (ICIQ’07). Citeseer, 76–91.Google ScholarGoogle Scholar
  26. [26] Grover Aditya and Leskovec Jure. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Gu Rong, Qi Yang, Wu Tongyu, Wang Zhaokang, Xu Xiaolong, Yuan C., and Huang Yihua. 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. Parallel Distributed Comput. 156 (2021), 132147.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Gudivada Venkat, Apon Amy, and Ding Junhua. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software 10, 1 (2017), 120.Google ScholarGoogle Scholar
  29. [29] Gudivada Venkat N., Rao Dhana, and Grosky William I.. 2016. Data quality centric application framework for big data. In The 2nd International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16), 33.Google ScholarGoogle Scholar
  30. [30] He Qing, Wang Haocheng, Zhuang Fuzhen, Shang Tianfeng, and Shi Zhongzhi. 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Helfert Markus and Foley Owen. 2009. A context aware information quality framework. 2009 4th International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187193.Google ScholarGoogle Scholar
  32. [32] Herschel Melanie and Manolescu I.. 2007. Declarative XML data cleaning with XClean. In CAiSE.Google ScholarGoogle Scholar
  33. [33] Hosseini Kasra, Nanni Federico, and Ardanuy Mariona Coll. 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 62–69.Google ScholarGoogle Scholar
  34. [34] IBM. 2020. The Four V’s of Big Data. Retrieved May 1, 2021 from http://www.ibmbigdatahub.com/infographic/four-vs-big-data. Accessed May 1, 2021.Google ScholarGoogle Scholar
  35. [35] Informatica. 2018. Informatica Data Quality Data Sheet. Technical Report. Informatica. Retrieved August 25, 2020 from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.Google ScholarGoogle Scholar
  36. [36] ISO/IEC. 2008. 25012:2008 Software Engineering–Software Product Quality Requirements and Evaluation (SQuaRE)—Data Quality Model. Standard. ISO/IEC.Google ScholarGoogle Scholar
  37. [37] ISO/IEC. 2012. ISO/IEC 25021:2012 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE) –Quality Measure Elements. Standard. ISO/IEC.Google ScholarGoogle Scholar
  38. [38] ISO/IEC. 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE. Standard. ISO/IEC.Google ScholarGoogle Scholar
  39. [39] ISO/IEC. 2015. ISO/IEC 25024:2015 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE)–Measurement of Data Quality. Standard. ISO/IEC.Google ScholarGoogle Scholar
  40. [40] ISO/IEC. 2017. ISO/IEC 15939:2017 Systems and Software Engineering–Measurement Process. Standard. ISO/IEC.Google ScholarGoogle Scholar
  41. [41] ISO/IEC. 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture. Standard. ISO/IEC.Google ScholarGoogle Scholar
  42. [42] Jeffery Shawn R., Alonso Gustavo, Franklin Michael J., Hong Wei, and Widom Jennifer. 2006. Declarative support for sensor data cleaning. In Pervasive Computing: 4th International Conference (PERVASIVE’06, Dublin, Ireland, May 7-10, 2006). Proceedings 4. Springer, 83–100.Google ScholarGoogle Scholar
  43. [43] Khayyat Zuhair, Ilyas Ihab F., Jindal Alekh, Madden S., Ouzzani M., Papotti Paolo, Quiané-Ruiz Jorge-Arnulfo, Tang Nan, and Yin Si. 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1215–1230.Google ScholarGoogle Scholar
  44. [44] Kim Jae Kwang and Wang Zhonglei. 2019. Sampling techniques for big data analysis. International Statistical Review 87 (2019), S177–S191.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Kim Won Y., Choi Byoungju, Hong E., Kim S., and Lee D.. 2003. A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 (2003), 81–99.Google ScholarGoogle Scholar
  46. [46] Leung H.. 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137152.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Merino Jorge, Caballero I., Rivas Bibiano, Serrano M., and Piattini M.. 2016. A data quality in use model for big data. Future Gener. Comput. Syst. 63 (2016), 123130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Microsoft. 2012. Introduction to Data Quality Services. Retrieved March 8, 2021 from https://docs.microsoft.com/en-us/sql/data-quality-services/introduction-to-data-quality-services?view=sql-server-ver15.Google ScholarGoogle Scholar
  49. [49] Müller Heiko and Freytag Johann Christoph. 2005. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Professoren des Inst. Für Informatik.Google ScholarGoogle Scholar
  50. [50] Nikiforova A. and Bicevskis J.. 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In ICEIS (1). 274–281.Google ScholarGoogle Scholar
  51. [51] Nikiforova A., Bicevskis J., Bicevska Zane, and Oditis Ivo. 2020. User-oriented approach to data quality evaluation. J. UCS 26, 1 (2020), 107126.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Oliveira Paulo, Rodrigues Fátima, and Henriques P.. 2005. A formal definition of data quality problems. In ICIQ.Google ScholarGoogle Scholar
  53. [53] Oliveira Paulo, Rodrigues Fátima, Henriques Pedro, and Galhardas Helena. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219233.Google ScholarGoogle Scholar
  54. [54] Rahm Erhard. 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer, 327.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Rahm E. and Do H.. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 313.Google ScholarGoogle Scholar
  56. [56] Ritze Dominique, Lehmberg Oliver, and Bizer Christian. 2015. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Sadineni Praveen Kumar. 2020. Sampling based join-aggregate query processing technique for big data. Indian Journal of Computer Science and Engineering 11, 5 (2020), 532546.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Saha B. and Srivastava D.. 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 12941297.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Sanders Peter, Lamm Sebastian, Hübschle-Schneider Lorenz, Schrade Emanuel, and Dachsbacher Carsten. 2018. Efficient parallel random sampling—Vectorized, cache-efficient, and online. ACM Transactions on Mathematical Software (TOMS) 44, 3 (2018), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Schelter Sebastian, Lange Dustin, Schmidt Philipp, Celikel Meltem, Biessmann Felix, and Grafberger Andreas. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 17811794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Software Calidad. 2022. ISO/IEC 25012. Retrieved March 22, 2020 from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.Google ScholarGoogle Scholar
  62. [62] Strong D., Lee Y., and Wang R.. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Taher Y., Haque Rafiqul, AlShaer Mohammed, Heuvel W., Hacid Mohand-Said, and Dbouk M.. 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24-28, 2016, Proceedings. Springer, 910–917.Google ScholarGoogle Scholar
  64. [64] Taleb Ikbal, Dssouli Rachida, and Serhani Mohamed Adel. 2015. Big data pre-processing: A quality framework. In 2015 IEEE International Congress on Big Data. 191198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Taleb Ikbal, Serhani Mohamed Adel, Bouhaddioui Chafik, and Dssouli Rachida. 2021. Big data quality framework: A holistic approach to continuous quality management. Journal of Big Data 8, 1 (2021), 141.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Taleb Ikbal, Serhani M. A., and Dssouli R.. 2018. Big data quality assessment model for unstructured data. In 2018 International Conference on Innovations in Information Technology (IIT’18). 6974.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Talend. 2020. How to Manage Modern Data Quality [White Paper]. Technical Report. Talend. Retrieved August 1, 2020 from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.Google ScholarGoogle Scholar
  68. [68] Wang R. and Strong D.. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Woodall P., Oberhofer Martin, and Borek A.. 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298321.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhang Pengcheng, Zhou Xuewu, Li Wenrui, and Gao Jerry Zeyu. 2017. A survey on quality assurance techniques for big data applications. 2017 IEEE 3rd International Conference on Big Data Computing Service and Applications (BigDataService’17). 313319.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. BIGQA: Declarative Big Data Quality Assessment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 15, Issue 3
      September 2023
      326 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/3611329
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2023
      • Online AM: 13 June 2023
      • Accepted: 12 May 2023
      • Revised: 16 April 2023
      • Received: 16 April 2022
      Published in jdiq Volume 15, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)423
      • Downloads (Last 6 weeks)52

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text