Abstract
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.
- [1] . 2005. An information model for software quality measurement with ISO standards. In Proceedings of the International Conference on Software Development (SWDC-REK’05), Reykjavik, 104–116.Google Scholar
- [2] . 2018. Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89 (2018), 548–562.Google ScholarDigital Library
- [3] . 2007. GrTP: Transformation based graphical tool building platform. In MDDAUI.Google Scholar
- [4] . 2015. From data quality to big data quality. J. Database Manag. 26, 1 (2015), 60–82.Google ScholarDigital Library
- [5] . 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695–701.Google ScholarDigital Library
- [6] . 2017. Some Declarative Approaches to Data Quality. Retrieved February 3, 2021 from http://people.scs.carleton.ca/bertossi/talks/tutBicod17.pdf.Google Scholar
- [7] . 2013. Generic and declarative approaches to data quality management. Handbook of Data Quality: Research and Practice, 181–211.Google Scholar
- [8] . 2017. Domain-specific characteristics of data quality. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS’17)999–1003.Google Scholar
- [9] . 2017. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference (ISM’17, Held as Part of FedCSIS, Prague, Czech Republic, September 3-6, 2017), Extended Selected Papers 15. Springer, 194–211.Google Scholar
- [10] . 2017. Executable data quality models. Procedia Computer Science 104 (2017), 138–145.Google ScholarDigital Library
- [11] . 2018. An approach to data quality evaluation. In 2018 5th International Conference on Social Networks Analysis, Management and Security (SNAMS’18)196–201.Google ScholarCross Ref
- [12] . 2007. Benchmarking declarative approximate selection predicates. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 353–364.Google Scholar
- [13] . 2014. Quality Factors in Big Data and Big Data Analytics. Retrieved December 30, 2019 from http://www.rogerclarke.com/EC/BDQF.html.Google Scholar
- [14] . 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1975–1975.Google ScholarDigital Library
- [15] . 2018. SQL Server Integration Services. Retrieved August 25, 2020 from https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-ver15.Google Scholar
- [16] . 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper].
Technical Report . Oracle Corporation. Retrieved August 1, 2020 from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.Google Scholar - [17] . 2006. Data Quality Evaluation in Data Integration Systems. Ph. D. Dissertation. Université de Versailles-Saint Quentin en Yvelines; Université de la République dÚruguay.Google Scholar
- [18] . 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In BDCSIntell, 52–56.Google Scholar
- [19] . 2018. RaDEn: A scalable and efficient radiation data engineering. In BDCSIntell, 89–93.Google Scholar
- [20] . 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363–370.Google Scholar
- [21] . 2006. Mastering Regular Expressions (3rd ed.). O’Reilly Media, Inc., Sebastopol, CA.Google ScholarDigital Library
- [22] . 2001. Declarative data cleaning: Language, model, and algorithms. In VLDB.Google Scholar
- [23] . 2016. Big data validation and quality assurance – Issues, challenges, and needs. In 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE’16)433–441.Google ScholarCross Ref
- [24] . 2017. How to Create a Business Case for Data Quality Improvement. Retrieved May 1, 2021 from https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/.Google Scholar
- [25] . 2007. A review of information quality research-develop a research agenda. In The International Conference on Information Quality (ICIQ’07). Citeseer, 76–91.Google Scholar
- [26] . 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864.Google ScholarDigital Library
- [27] . 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. Parallel Distributed Comput. 156 (2021), 132–147.Google ScholarCross Ref
- [28] . 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software 10, 1 (2017), 1–20.Google Scholar
- [29] . 2016. Data quality centric application framework for big data. In The 2nd International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16), 33.Google Scholar
- [30] . 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117–133.Google ScholarDigital Library
- [31] . 2009. A context aware information quality framework. 2009 4th International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187–193.Google Scholar
- [32] . 2007. Declarative XML data cleaning with XClean. In CAiSE.Google Scholar
- [33] . 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 62–69.Google Scholar
- [34] . 2020. The Four V’s of Big Data. Retrieved May 1, 2021 from http://www.ibmbigdatahub.com/infographic/four-vs-big-data. Accessed May 1, 2021.Google Scholar
- [35] . 2018. Informatica Data Quality Data Sheet.
Technical Report . Informatica. Retrieved August 25, 2020 from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.Google Scholar - [36] . 2008. 25012:2008 Software Engineering–Software Product Quality Requirements and Evaluation (SQuaRE)—Data Quality Model.
Standard . ISO/IEC.Google Scholar - [37] . 2012. ISO/IEC 25021:2012 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE) –Quality Measure Elements.
Standard . ISO/IEC.Google Scholar - [38] . 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE.
Standard . ISO/IEC.Google Scholar - [39] . 2015. ISO/IEC 25024:2015 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE)–Measurement of Data Quality.
Standard . ISO/IEC.Google Scholar - [40] . 2017. ISO/IEC 15939:2017 Systems and Software Engineering–Measurement Process.
Standard . ISO/IEC.Google Scholar - [41] . 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture.
Standard . ISO/IEC.Google Scholar - [42] . 2006. Declarative support for sensor data cleaning. In Pervasive Computing: 4th International Conference (PERVASIVE’06, Dublin, Ireland, May 7-10, 2006). Proceedings 4. Springer, 83–100.Google Scholar
- [43] . 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1215–1230.Google Scholar
- [44] . 2019. Sampling techniques for big data analysis. International Statistical Review 87 (2019), S177–S191.Google ScholarCross Ref
- [45] . 2003. A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 (2003), 81–99.Google Scholar
- [46] . 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137–152.Google ScholarCross Ref
- [47] . 2016. A data quality in use model for big data. Future Gener. Comput. Syst. 63 (2016), 123–130.Google ScholarDigital Library
- [48] . 2012. Introduction to Data Quality Services. Retrieved March 8, 2021 from https://docs.microsoft.com/en-us/sql/data-quality-services/introduction-to-data-quality-services?view=sql-server-ver15.Google Scholar
- [49] . 2005. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Professoren des Inst. Für Informatik.Google Scholar
- [50] . 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In ICEIS (1). 274–281.Google Scholar
- [51] . 2020. User-oriented approach to data quality evaluation. J. UCS 26, 1 (2020), 107–126.Google ScholarCross Ref
- [52] . 2005. A formal definition of data quality problems. In ICIQ.Google Scholar
- [53] . 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219–233.Google Scholar
- [54] . 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer, 3–27.Google ScholarCross Ref
- [55] . 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13.Google Scholar
- [56] . 2015. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 1–6.Google ScholarDigital Library
- [57] . 2020. Sampling based join-aggregate query processing technique for big data. Indian Journal of Computer Science and Engineering 11, 5 (2020), 532–546.Google ScholarCross Ref
- [58] . 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 1294–1297.Google ScholarCross Ref
- [59] . 2018. Efficient parallel random sampling—Vectorized, cache-efficient, and online. ACM Transactions on Mathematical Software (TOMS) 44, 3 (2018), 1–14.Google ScholarDigital Library
- [60] . 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.Google ScholarDigital Library
- [61] . 2022. ISO/IEC 25012. Retrieved March 22, 2020 from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.Google Scholar
- [62] . 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103–110.Google ScholarDigital Library
- [63] . 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24-28, 2016, Proceedings. Springer, 910–917.Google Scholar
- [64] . 2015. Big data pre-processing: A quality framework. In 2015 IEEE International Congress on Big Data. 191–198.Google ScholarDigital Library
- [65] . 2021. Big data quality framework: A holistic approach to continuous quality management. Journal of Big Data 8, 1 (2021), 1–41.Google ScholarCross Ref
- [66] . 2018. Big data quality assessment model for unstructured data. In 2018 International Conference on Innovations in Information Technology (IIT’18). 69–74.Google ScholarCross Ref
- [67] . 2020. How to Manage Modern Data Quality [White Paper].
Technical Report . Talend. Retrieved August 1, 2020 from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.Google Scholar - [68] . 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5–33.Google ScholarDigital Library
- [69] . 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298–321.Google ScholarCross Ref
- [70] . 2017. A survey on quality assurance techniques for big data applications. 2017 IEEE 3rd International Conference on Big Data Computing Service and Applications (BigDataService’17). 313–319.Google ScholarCross Ref
Index Terms
- BIGQA: Declarative Big Data Quality Assessment
Recommendations
Context-aware Big Data Quality Assessment: A Scoping Review
The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause ...
A Data Quality in Use model for Big Data
Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, ...
Addressing data quality problems with metamorphic data relations
MET '19: Proceedings of the 4th International Workshop on Metamorphic TestingIn the era of big data, cloud computing and the Internet of Things, the quality of data has tremendous impact on our everyday life. Moreover, the increasing velocity, volume and variety of data requires new approaches for quality assessment. In this ...
Comments