survey

Context-aware Big Data Quality Assessment: A Scoping Review

Authors:
Hadi Fadlallah

Saint-Joseph University, Lebanon

Saint-Joseph University, Lebanon

0000-0003-1160-5980
View Profile

,
Rima Kilany

Saint-Joseph University, Lebanon

Saint-Joseph University, Lebanon

0000-0002-5710-6901
View Profile

,
Houssein Dhayne

Saint-Joseph University, Lebanon

Saint-Joseph University, Lebanon

0000-0002-7476-4754
View Profile

,
Rami El Haddad

Saint-Joseph University, Lebanon

Saint-Joseph University, Lebanon

0000-0001-6285-0279
View Profile

,
Rafiqul Haque

Intelligencia R & D, France

Intelligencia R & D, France

0000-0001-5705-3427
View Profile

,
Yehia Taher

University of Versailles Saint-Quentin-en-Yvelines (UVSQ), France

University of Versailles Saint-Quentin-en-Yvelines (UVSQ), France

0000-0002-8706-8889
View Profile

,
Ali Jaber

Lebanese University, Lebanon

Lebanese University, Lebanon

0000-0002-6976-133X
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 15 Issue 3Article No.: 25pp 1–33https://doi.org/10.1145/3603707

Published:22 August 2023Publication History

Journal of Data and Information Quality

Abstract

The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners.

Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging.

Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever.

This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.

REFERENCES

[1] Abedjan Ziawasch, Golab Lukasz, and Naumann Felix. 2017. Data profiling: A tutorial. In Proceedings of the 2017 ACM International Conference on Management of Data (2017), 1747–1751.Google Scholar
[2] Abedjan Ziawasch, Golab Lukasz, Naumann Felix, and Papenbrock Thorsten. 2018. Data profiling. Synthes. Lect. Data Manag. 10, 4 (2018), 1–154.Google ScholarCross Ref
[3] Acosta Maribel, Zaveri Amrapali, Simperl Elena, Kontokostas Dimitris, Auer Sören, and Lehmann Jens. 2013. Crowdsourcing linked data quality assessment. In The Semantic Web–ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013, Proceedings, Part II 12. Springer, 260–276.Google ScholarDigital Library
[4] Agrawal Divyakant, Bernstein Philip, Bertino Elisa, Davidson Susan, Dayal Umeshwas, Franklin Michael, Gehrke Johannes, Haas Laura, Halevy Alon, Han Jiawei et al. 2011. Challenges and Opportunities with Big Data [White Paper]. Technical Report. Computing Research Association. Retrieved from http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.Google Scholar
[5] Al-Jaroodi Jameela and Mohamed Nader. 2018. Service-oriented architecture for big data analytics in smart cities. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). 633–640.Google Scholar
[6] AlShaer Mohammed, Taher Yehia, Haque Rafiqul, Hacid Mohand-Saïd, and Dbouk Mohamed. 2019. IBRIDIA: A hybrid solution for processing big logistics data. Fut. Gen. Comput. Syst. 97 (2019), 792–804.Google ScholarDigital Library
[7] Ardagna Danilo, Cappiello Cinzia, Samá Walter, and Vitali Monica. 2018. Context-aware data quality assessment for big data. Fut. Gen. Comput. Syst. 89 (2018), 548–562.Google ScholarDigital Library
[8] Azeroual Otmane and Abuosba Mohammad. 2019. Improving the data quality in the research information systems. arXiv preprint arXiv:1901.07388 (2019).Google Scholar
[9] Bārzdiņš Jānis, Zariņš Andris, Čerāns Kārlis, Kalniņš Audris, Rencis Edgars, Lāce Lelde, Liepiņš Renārs, and Sprog̀is Artūrs. 2007. GrTP: Transformation based graphical tool building platform. In 10th International Conference on Model-driven Engineering Languages and Systems, Models.Google Scholar
[10] Batini Carlo, Cabitza Federico, Cappiello Cinzia, and Francalanci Chiara. 2008. A comprehensive data quality methodology for web and structured data. Int. J. Innov. Comput. Applic. 1, 3 (2008), 205–218.Google ScholarDigital Library
[11] Batini Carlo, Rula Anisa, Scannapieco Monica, and Viscusi Gianluigi. 2015. From data quality to big data quality. J. Datab. Manag. 26, 1 (2015), 60–82.Google ScholarDigital Library
[12] Bello Sururah A., Oyedele Lukumon O., Akinade Olugbenga O., Bilal Muhammad, Delgado Juan Manuel Davila, Akanbi Lukman A., Ajayi Anuoluwapo O., and Owolabi Hakeem A.. 2021. Cloud computing in construction industry: Use cases, benefits and challenges. Automat. Construct. 122 (2021), 103441.Google ScholarCross Ref
[13] Bernstein Philip A., Madhavan Jayant, and Rahm Erhard. 2011. Generic schema matching, ten years later. Proc. VLDB Endow. 4, 11 (2011), 695–701.Google ScholarDigital Library
[14] Bhimani Janki, Mi Ningfang, Leeser Miriam, and Yang Zhengyu. 2017. FiM: Performance prediction for parallel computation in iterative data processing applications. In IEEE 10th International Conference on Cloud Computing (CLOUD’17). 359–366.Google Scholar
[15] Bhimani Janki, Mi Ningfang, Leeser Miriam, and Yang Zhengyu. 2019. New performance modeling methods for parallel data processing applications. ACM Trans. Model. Comput. Simul. 29, 3 (2019), 1–24.Google ScholarDigital Library
[16] Bicevska Zane, Bicevskis Janis, and Oditis Ivo. 2017. Domain-specific characteristics of data quality. Federated Conference on Computer Science and Information Systems (FedCSIS’17). 999–1003.Google Scholar
[17] Bicevska Zane, Bicevskis Janis, and Oditis Ivo. 2018. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference, ISM 2017, Held as Part of FedCSIS, Prague, Czech Republic, September 3–6, 2017, Extended Selected Papers 15. Springer, 194–211.Google ScholarCross Ref
[18] Bicevskis Janis, Bicevska Zane, and Karnitis Girts. 2017. Executable data quality models. Procedia Comput. Sci. 104 (2017), 138–145.Google ScholarDigital Library
[19] Bicevskis Janis, Bicevska Zane, Nikiforova Anastasija, and Oditis Ivo. 2018. An approach to data quality evaluation. In Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS’18). 196–201.Google Scholar
[20] Biscobing Jacqueline. 2018. What Is Data Sampling? Retrieved from https://www.techtarget.com/searchbusinessanalytics/definition/data-sampling.Google Scholar
[21] Bronselaer Antoon, Nielandt Joachim, Boeckling Toon, and Tré Guy De. 2018. Operational measurement of data quality. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11–15, 2018, Proceedings, Part III 17. Springer, 517–528.Google ScholarCross Ref
[22] Brüggemann Stefan and Grüning Fabian. 2009. Using ontologies providing domain knowledge for data quality management. Networked Knowledge-Networked Media: Integrating Knowledge Management, New Media Technologies and Semantic Systems. Springer, 187–203.Google ScholarCross Ref
[23] Buneman Peter and Davidson Susan B.. 2010. Data provenance–The foundation of data quality. In Workshop: Issues and Opportunities for Improving the Quality and Use of Data within the DoD, Arlington, 26–28.Google Scholar
[24] Cai Li and Zhu Yangyong. 2015. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015).Google ScholarCross Ref
[25] Carlo Batini, Daniele Barone, Federico Cabitza, and Simone Grega. 2011. A data quality methodology for heterogeneous data. Int. J. Datab. Manag. Syst. 3, 1 (2011), 60–79.Google Scholar
[26] Choi O.-Hoon, Lim Jun-Eun, Na Hong-Seok, and Baik Doo-Kwon. 2008. An efficient method of data quality using quality evaluation ontology. 2008 Third International Conference on Convergence and Hybrid Information Technology 2 (2008), 1058–1061.Google Scholar
[27] Cichy Corinna and Rass Stefan. 2019. An overview of data quality frameworks. IEEE Access 7 (2019), 24634–24648.Google ScholarCross Ref
[28] Clarke Roger. 2014. Quality Factors in Big Data and Big Data Analytics. Xamax Consultancy Pty Ltd.Google Scholar
[29] Cormode Graham and Duffield Nick. 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1975–1975.Google Scholar
[30] Corporation Microsoft. 2013. Data Quality Services. Retrieved from https://docs.microsoft.com/en-us/sql/data-quality-services/data-quality-services?view=sql-server-ver15.Google Scholar
[31] Corporation Microsoft. 2018. SQL Server Integration Services. Retrieved from https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-ver15.Google Scholar
[32] Corporation Oracle. 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper]. Technical Report. Oracle Corporation. Retrieved from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.Google Scholar
[33] Dai Wei, Wardlaw Isaac, Cui Yu, Mehdi Kashif, Li Yanyan, and Long Jun. 2016. Data profiling technology of data governance regarding big data: Review and rethinking. In Information Technology: New Generations: 13th International Conference on Information Technology. Springer, 439–450.Google ScholarCross Ref
[34] Dai Wei, Yoshigoe Kenji, and Parsley William. 2018. Improving data quality through deep learning and statistical models. In Information Technology-New Generations: 14th International Conference on Information Technology. 515–522.Google Scholar
[35] Daki Houda, Hannani Asmaa El, Aqqal Abdelhak, Haidine Abdelfattah, and Dahbi Aziz. 2017. Big Data management in smart grid: Concepts, requirements and implementation. J. Big Data 4, 1 (2017), 1–19.Google ScholarCross Ref
[36] Dean Jeffrey and Ghemawat Sanjay. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1, 107–113.Google ScholarDigital Library
[37] Dhayne Houssein, Haque Rafiqul, Kilany Rima, and Taher Yehia. 2019. In search of big medical data integration solutions—A comprehensive survey. IEEE Access 7 (2019), 91265–91290.Google ScholarCross Ref
[38] Dmitriyev Viktor, Mahmoud Tariq, and Marín-Ortega Pablo Michel. 2015. Int. J. Inf. Syst. Proj. Manag. 3, 3 (2015), 49–63.Google Scholar
[39] Dong Xin Luna, Berti-Equille Laure, and Srivastava Divesh. 2013. Data fusion: Resolving conflicts from multiple sources. Handbook of Data Quality: Research and Practice. Springer, 293–318.Google ScholarCross Ref
[40] Dong Xin Luna and Srivastava Divesh. 2013. Big data integration. In IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 1245–1248.Google ScholarDigital Library
[41] Dragoni Nicola, Lanese Ivan, Larsen Stephan Thordal, Mazzara Manuel, Mustafin Ruslan, and Safina Larisa. 2018. Microservices: How to make your application scale. In Perspectives of System Informatics: 11th International Andrei P. Ershov Informatics Conference, PSI 2017, Moscow, Russia, June 27–29, 2017, Revised Selected Papers 11. Springer, 95–104.Google ScholarCross Ref
[42] Durairaj M. and Poornappriya T. S.. 2018. Importance of MapReduce for big data applications: A survey. Asian J. Comput. Sci. Technol. 7, 1 (2018), 112–118.Google ScholarCross Ref
[43] Ehrlinger Lisa, Werth Bernhard, and Wöß Wolfram. 2018. Automated continuous data quality measurement with QuaIIe. Int. J. Advanc. Softw. 11, 3 (2018), 400–417.Google Scholar
[44] Ehrlinger Lisa, Werth Bernhard, and Wöß Wolfram. 2018. QuaIIe: A data quality assessment tool for integrated information systems. In 10th International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA’18). 21–31.Google Scholar
[45] Ehrlinger Lisa and Wöß Wolfram. 2017. Automated data quality monitoring. In 22nd MIT International Conference on Information Quality (ICIQ’17). 15–1.Google Scholar
[46] Even Adir and Shankaranarayanan Ganesan. 2005. Value-driven data quality assessment. In International Conference on Information Quality (ICIQ’05).Google Scholar
[47] Even Adir and Shankaranarayanan Ganesan. 2007. Utility-driven assessment of data quality. ACM SIGMIS Datab.: DATAB. Adv. Inf. Syst. 38, 2 (2007), 75–93.Google ScholarDigital Library
[48] Fadlallah Hadi, Taher Yehia, Haque Rafiqul, and Jaber Ali. 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’19). 52–56.Google Scholar
[49] Fadlallah Hadi, Taher Yehia, and Jaber Ali. 2018. RaDEn: A scalable and efficient radiation data engineering. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’18). 89–93.Google Scholar
[50] Salas Óscar Figuerola, Adzic Velibor, Shah Akash, and Kalva Hari. 2013. Assessing internet video quality using crowdsourcing. In 2nd ACM International Workshop on Crowdsourcing for Multimedia. 23–28.Google Scholar
[51] Finkel Jenny Rose, Grenager Trond, and Manning Christopher D.. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363–370.Google Scholar
[52] Gao Jerry, Xie Chunli, and Tao Chuanqi. 2016. Big data validation and quality assuranceIssues, challenges, and needs. In IEEE symposium on service-oriented system engineering (SOSE16). 433–441.Google Scholar
[53] Ge Mouzhi and Helfert Markus. 2007. A review of information quality research-develop a research agenda. In International Conference on Information Quality (ICIQ’07). 76–91.Google Scholar
[54] Gu Rong, Qi Yang, Wu Tongyu, Wang Zhaokang, Xu Xiaolong, Yuan Chunfeng, and Huang Yihua. 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. ParallelDistrib. Comput. 156 (2021), 132–147.Google ScholarCross Ref
[55] Gudivada Venkat, Apon Amy, and Ding Junhua. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Advanc. Softw. 10, 1 (2017), 1–20.Google Scholar
[56] Gudivada Venkat N., Rao Dhana, and Grosky William I.. 2016. Data quality centric application framework for big data. In International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16).Google Scholar
[57] Hariri Reihaneh H., Fredericks Erik M., and Bowers Kate M.. 2019. Uncertainty in big data analytics: Survey, opportunities, and challenges. J. Big Data 6, 1 (2019), 1–16.Google ScholarCross Ref
[58] Hasselbring Wilhelm. 2016. Microservices for scalability: Keynote talk abstract. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. 133–134.Google Scholar
[59] Hay Brian, Nance Kara, and Bishop Matt. 2011. Storm clouds rising: Security challenges for IaaS cloud computing. In 2011 44th Hawaii International Conference on System Sciences. 1–7.Google Scholar
[60] He Qinlu, Li Zhanhuai, and Zhang Xiao. 2010. Data deduplication techniques. In 2010 International Conference on Future Information Technology and Management Engineering 1 (2010), 430–433.Google Scholar
[61] He Qing, Wang Haocheng, Zhuang Fuzhen, Shang Tianfeng, and Shi Zhongzhi. 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117–133.Google ScholarDigital Library
[62] Helfert Markus and Foley Owen. 2009. A context aware information quality framework. In 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187–193.Google Scholar
[63] Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard de, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, et al. 2021. Knowledge graphs. ACM Comput. Surv. 54, 4 (2021), 1–37.Google ScholarDigital Library
[64] Hosseini Kasra, Nanni Federico, and Ardanuy Mariona Coll. 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 62–69.Google ScholarCross Ref
[65] Hoßfeld Tobias, Hirth Matthias, Korshunov Pavel, Hanhart Philippe, Gardlo Bruno, Keimel Christian, and Timmerer Christian. 2014. Survey of web-based crowdsourcing frameworks for subjective quality assessment. In IEEE 16th International Workshop on Multimedia Signal Processing (MMSP’14). 1–6.Google Scholar
[66] Ilyas Ihab F. and Chu Xu. 2019. Data Cleaning. ACM New York, NY.Google ScholarDigital Library
[67] Immonen Anne, Pääkkönen Pekka, and Ovaska Eila. 2015. Evaluating the quality of social media data in big data architecture. IEEE Access 3 (2015), 2028–2043.Google ScholarCross Ref
[68] Inc. Talend2022. Data Quality and Machine Learning: What’s the Connection? Retrieved from https://www.talend.com/resources/machine-learning-data-quality/.Google Scholar
[69] Informatica. 2018. Informatica Data Quality Data Sheet. Technical Report. Informatica. Retrieved from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.Google Scholar
[70] Iqbal Muhammad Hussain, Soomro Tariq Rahim et al. 2015. Big data analysis: Apache Storm perspective. Int. J. Comput. Trends Technol. 19, 1 (2015), 9–14.Google ScholarCross Ref
[71] ISO/IEC. 2001. ISO/IEC 9126-1:2001. Software Engineering – Product Quality – Part 1: Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/22749.html.Google Scholar
[72] ISO/IEC. 2008. 25012:2008 Software Engineering – Software Product Quality Requirements and Evaluation (SQuaRE) – Data Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35736.html.Google Scholar
[73] ISO/IEC. 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/64764.html.Google Scholar
[74] ISO/IEC. 2015. ISO/IEC 25024:2015 Systems and Software Engineering – Systems and Software Quality Requirements and Evaluation (SQuaRE) – Measurement of Data Quality. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35749.html.Google Scholar
[75] ISO/IEC. 2017. ISO/IEC 15939:2017 Systems and Software Engineering – Measurement Process. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71197.html.Google Scholar
[76] ISO/IEC. 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71277.html.Google Scholar
[77] ISO/IEC. 2022. ISO/IEC AWI 5259-1 Artificial Intelligence – Data Quality for Analytics and Machine Learning (ML) – Part 1: Overview, Terminology, and Examples. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/81088.html.Google Scholar
[78] ISO/TS. 2011. ISO/TS 8000-1:2011 - Data Quality - Part 1: Overview. Standard. ISO/TS. Retrieved from https://www.iso.org/standard/50798.html.Google Scholar
[79] Iverson Michael A., Ozguner Fusun, and Potter Lee C.. 1999. Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In Proceedings Eighth Heterogeneous Computing Workshop (HCW’99). 99–111.Google Scholar
[80] Ji Changqing, Li Yu, Qiu Wenming, Awada Uchechukwu, and Li Keqiu. 2012. Big data processing in cloud computing environments. In 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks (2012), 17–23.Google Scholar
[81] Kadadi Anirudh, Agrawal Rajeev, Nyamful Christopher, and Atiq Rahman. 2014. Challenges of data integration and interoperability in big data. In 2014 IEEE International Conference on Big Data (big data) (2014), 38–40.Google Scholar
[82] Kaiser Jiří. 2014. Dealing with missing values in data. J. Syst. Integr. 5, 1 (2014) 42–51.Google Scholar
[83] Karami Amir, Gangopadhyay Aryya, Zhou Bin, and Kharrazi Hadi. 2015. A fuzzy approach model for uncovering hidden latent semantic structure in medical text collections. In iConference 2015.Google Scholar
[84] Karmakar Anurag, Raghuthaman Anaswara, Kote Om Sudhakar, and Jayapandian N.. 2022. Cloud computing application: Research challenges and opportunity. In International Conference on Sustainable Computing and Data Communication Systems (ICSCDS’22). IEEE, 1284–1289.Google ScholarCross Ref
[85] Khayyat Zuhair, Ilyas Ihab F., Jindal Alekh, Madden S., Ouzzani M., Papotti Paolo, Quiané-Ruiz Jorge-Arnulfo, Tang Nan, and Yin Si. 2015. BigDansing: A system for big data cleansing. In SIGMOD Conference.Google ScholarDigital Library
[86] Kim Jae Kwang and Wang Zhonglei. 2019. Sampling techniques for big data analysis. Int. Statist. Rev. 87 (2019), S177–S191.Google ScholarCross Ref
[87] Kontokostas Dimitris, Zaveri Amrapali, Auer Sören, and Lehmann Jens. 2013. TripleCheckMate: A tool for crowdsourcing the quality assessment of linked data. In Knowledge Engineering and the Semantic Web: 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7–9, 2013. Proceedings 4. Springer, 265–272.Google ScholarCross Ref
[88] Kumar Pradeep, Bhatnagar Roheet, Gaur Kuntal, and Bhatnagar Anurag. 2021. Classification of imbalanced data: Review of methods and applications. IOP Conference Series: Materials Science and Engineering 1099, 1 (2021), 012077.Google Scholar
[89] Kusumasari Tien Fabrianti et al. 2016. Data profiling for data quality improvement with OpenRefine. In International Conference on Information Technology Systems and Innovation (ICITSI’16). 1–6.Google Scholar
[90] Leung Hareton K. N.. 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137–152.Google ScholarCross Ref
[91] Liu Zhicheng and Zhang Aoqian. 2020. Sampling for big data profiling: A survey. IEEE Access 8 (2020), 72713–72726.Google ScholarCross Ref
[92] L’Heureux Alexandra, Grolinger Katarina, Elyamany Hany F., and Capretz Miriam A. M.. 2017. Machine learning with big data: Challenges and approaches. IEEE Access 5 (2017), 7776–7797.Google ScholarCross Ref
[93] Malhotra Jyoti and Bakal Jagdish. 2015. A survey and comparative study of data deduplication techniques. In International Conference on Pervasive Computing (ICPC’15). 1–5.Google Scholar
[94] McKelvey Nigel, Curran Kevin, and Toland Luke. 2016. The Challenges of Data Cleansing with Data Warehouses. 77–82. DOI:Google ScholarCross Ref
[95] Mehrtak Mohammad, SeyedAlinaghi SeyedAhmad, MohsseniPour Mehrzad, Noori Tayebeh, Karimi Amirali, Shamsabadi Ahmadreza, Heydari Mohammad, Barzegary Alireza, Mirzapour Pegah, Soleymanzadeh Mahdi, et al. 2021. Security challenges and solutions using healthcare cloud computing. J. Med. Life 14, 4 (2021), 448.Google ScholarCross Ref
[96] Merino Jorge, Caballero Ismael, Rivas Bibiano, Serrano Manuel, and Piattini Mario. 2016. A data quality in use model for big data. Fut. Gen. Comput. Syst. 63 (2016), 123–130.Google ScholarDigital Library
[97] Mihindukulasooriya Nandana, García-Castro Raúl, Priyatna Freddy, Ruckhaus Edna, and Saturno Nelson. 2017. A linked data profiling service for quality assessment. In The Semantic Web: ESWC 2017 Satellite Events: ESWC 2017 Satellite Events, Portorož, Slovenia, May 28–June 1, 2017, Revised Selected Papers 14. Springer, 335–340.Google Scholar
[98] Missier Paolo, Embury Suzanne, Greenwood Mark, Preece Alun, and Jin Binling. 2006. Quality views: Capturing and exploiting the user perspective on data quality. In International Conference on Very Large Data Bases.Google Scholar
[99] Mousannif Hajar, Sabah Hasna, Douiji Yasmina, and Sayad Younes Oulad. 2014. From big data to big projects: A step-by-step roadmap. In 2014 International Conference on Future Internet of Things and Cloud. 373–378.Google Scholar
[100] Munn Zachary, Peters Micah D. J., Stern Cindy, Tufanaru Catalin, McArthur Alexa, and Aromataris Edoardo. 2018. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 18 (2018), 1–7.Google ScholarCross Ref
[101] Mylavarapu Goutam, Thomas Johnson P., and Viswanathan K. Ashwin. 2019. An automated big data accuracy assessment tool. In IEEE 4th International Conference on Big Data Analytics (ICBDA’19). 193–197.Google Scholar
[102] Mylavarapu Goutam, Viswanathan K. Ashwin, and Thomas Johnson P.. 2019. Assessing context-aware data consistency. In IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA’19). 1–6.Google Scholar
[103] Najafabadi Maryam M., Villanustre Flavio, Khoshgoftaar Taghi M., Seliya Naeem, Wald Randall, and Muharemagic Edin. 2015. Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015), 1–21.Google ScholarCross Ref
[104] Nargesian Fatemeh, Zhu Erkang, Miller Renée J., Pu Ken Q., and Arocena Patricia C.. 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989.Google ScholarDigital Library
[105] Naumann Felix. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 4 (2014), 40–49.Google ScholarDigital Library
[106] Niemelä Eila, Evesti Antti, and Savolainen Pekka. 2008. Modeling quality attribute variability. In International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE’08). 169–176.Google Scholar
[107] Nikiforova Anastasija and Bicevskis Janis. 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In International Conference on Enterprise Information Systems (ICEIS’19). 274–281.Google ScholarCross Ref
[108] Nikiforova Anastasija, Bicevskis Janis, Bicevska Zane, and Oditis Ivo. 2020. User-oriented approach to data quality evaluation. J. Univers. Comput. Sci. 26, 1 (2020), 107–126.Google ScholarCross Ref
[109] Pääkkönen Pekka and Pakkala Daniel. 2015. Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 4 (2015), 166–186.Google ScholarDigital Library
[110] Patel-Schneider Peter F.. 2015. Towards large-scale schema and ontology matching. Retrieved from https://www.semanticscholar.org/paper/Towards-Large-scale-Schema-And-Ontology-Matching-Patel-Schneider/ceee2bdaef83a0f09480fa6fb191cf3372137152.Google Scholar
[111] Pérez Beatriz, Rubio Julio, and Sáenz-Adán Carlos. 2018. A systematic review of provenance systems. Knowl. Inf. Syst. 57 (2018), 495–543.Google ScholarDigital Library
[112] Pipino Leo L., Lee Yang W., and Wang Richard Y.. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211–218.Google ScholarDigital Library
[113] Price Rosanne, Neiger Dina, and Shanks Graeme. 2008. Developing a measurement instrument for subjective aspects of information quality. Commun. Assoc. Inf. Syst. 22, 1 (2008), 3.Google Scholar
[114] Rahul Kumar and Banyal R. K.. 2019. Data cleaning mechanism for big data and cloud computing. In 6th International Conference on Computing for Sustainable Global Development (INDIACom’19). 195–198.Google Scholar
[115] Ramaswamy Lakshmish, Lawson Victor, and Gogineni Siva Venkat. 2013. Towards a quality-centric big data architecture for federated sensor services. In 2013 IEEE International Congress on Big Data. 86–93.Google Scholar
[116] Rawat R. and Yadav R.. 2021. Big data: Big data analysis, issues and challenges and technologies. IOP Conference Series: Materials Science and Engineering 1022, 1 (2021), 012014.Google Scholar
[117] Sadineni Praveen Kumar. 2020. Sampling based join-aggregate query processing technique for big data. Indian J. Comput. Sci. Eng. 11, 5, 532–546.Google ScholarCross Ref
[118] Saha Barna and Srivastava Divesh. 2014. Data quality: The other face of big data. In 2014 IEEE 30th International Conference on Data Engineering. 1294–1297.Google Scholar
[119] Schelter Sebastian, Lange Dustin, Schmidt Philipp, Celikel Meltem, Biessmann Felix, and Grafberger Andreas. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.Google ScholarDigital Library
[120] Sharma Gaurav. 2021. Data Quality. Retrieved from https://www.computer.org/publications/tech-news/trends/big-data-and-cloud-computing.Google Scholar
[121] Siegmund Norbert, Rosenmüller Marko, Kuhlemann Martin, Kästner Christian, Apel Sven, Duchateau Fabien, and Fagnan Justin. 2015. Schema matching bibtex. In Proceedings of the VLDB Endowment.Google Scholar
[122] Software Calidad. 2022. ISO/IEC 25012. Retrieved from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.Google Scholar
[123] Stojanović Dragan, Stojanović Natalija, and Turanjanin Jovan. 2015. Processing big trajectory and Twitter data streams using Apache STORM. (2015), 301–304. Retrieved from https://www.semanticscholar.org/paper/Schema-Matching-Bibtex-Siegmund-Rosenm%C3%BCller/a4d94ddaab429e5874386dd29822e470b57d6ee4.Google Scholar
[124] Strong Diane M., Lee Yang W., and Wang Richard Y.. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103–110.Google ScholarDigital Library
[125] Taher Yehia, Haque Rafiqul, AlShaer Mohammed, Heuvel Willem Jan van den, Hacid Mohand-Saïd, and Dbouk Mohamed. 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24–28, 2016, Proceedings. Springer, 910–917.Google ScholarCross Ref
[126] Taher Yehia, Haque Rafiqul, and Hacid Mohand-Said. 2017. BDLaaS: Big data lab as a service for experimenting big data solution. In IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W’17). 155–159.Google Scholar
[127] Taleb Ikbal, Dssouli Rachida, and Serhani Mohamed Adel. 2015. Big data pre-processing: A quality framework. (2015), 191–198.Google Scholar
[128] Taleb Ikbal, Serhani Mohamed Adel, and Dssouli Rachida. 2018. Big data quality assessment model for unstructured data. In International Conference on Innovations in Information Technology (IIT’18). 69–74.Google Scholar
[129] Taleb Ikbal, Serhani Mohamed Adel, and Dssouli Rachida. 2019. Big data quality: A data quality profiling model. In Services–SERVICES 2019: 15th World Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 15. Springer, 61–77.Google ScholarDigital Library
[130] Talend. 2020. How to Manage Modern Data Quality [White Paper]. Technical Report. Talend. Retrieved from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.Google Scholar
[131] Talha Mohamed, Elmarzouqi Nabil, and Kalam Anas Abou El. 2020. Towards a powerful solution for data accuracy assessment in the big data context. Int. J. Advanc. Comput. Sci. Applic. 11, 2 (2020).Google Scholar
[132] Venkataraman Shivaram, Yang Zongheng, Franklin Michael, Recht Benjamin, and Stoica Ion. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 363–378.Google Scholar
[133] Wang Lidong and Alexander Cheryl Ann. 2016. Machine learning in big data. Int. J. Math., Eng. Manag. Sci. 1, 2 (2016), 52–61.Google Scholar
[134] Wang Richard Y.. 1998. A product perspective on total data quality management. Commun. ACM 41, 2 (1998), 58–65.Google ScholarDigital Library
[135] Wang Richard Y. and Strong Diane. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12 (1996), 5–33.Google ScholarDigital Library
[136] Wang Xinxin, Dang Depeng, and Guo Zixian. 2020. Evaluating the crowd quality for subjective questions based on a Spark computing environment. Fut. Gen. Comput. Syst. 106 (2020), 426–437.Google ScholarDigital Library
[137] Wei-Liang Chen, Shi-Dong Zhang, and Xiang Gao. 2009. Anchoring the consistency dimension of data quality using ontology in data integration. (2009), 201–205.Google Scholar
[138] Woodall Philip, Oberhofer Martin, and Borek Alexander. 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298–321.Google Scholar
[139] Zaslavsky Arkady, Perera Charith, and Georgakopoulos Dimitrios. 2013. Sensing as a service and big data. arXiv preprint arXiv:1301.0159 (2013).Google Scholar
[140] Zaveri Amrapali, Kontokostas Dimitris, Sherif Mohamed A., Bühmann Lorenz, Morsey Mohamed, Auer Sören, and Lehmann Jens. 2013. User-driven quality evaluation of DBpedia. In 9th International Conference on Semantic Systems. 97–104.Google ScholarDigital Library
[141] Zhang Pengcheng, Zhou Xuewu, Li Wenrui, and Gao Jerry. 2017. A survey on quality assurance techniques for big data applications. (2017), 313–319.Google Scholar
[142] Zhang Zhenrong, Zhang Jianshu, Du Jun, and Wang Fengren. 2022. Split, embed and merge: An accurate table structure recognizer. Pattern Recognit. 126 (2022), 108565.Google ScholarDigital Library
[143] Zhou Lina, Pan Shimei, Wang Jianwu, and Vasilakos Athanasios V.. 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing 237 (2017), 350–361.Google ScholarDigital Library

Index Terms

Context-aware Big Data Quality Assessment: A Scoping Review
1. Information systems
  1. Data management systems

Recommendations

BIGQA: Declarative Big Data Quality Assessment
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data ...
Read More
A Data Quality in Use model for Big Data

Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, ...
Read More
Context-aware data quality assessment for big data
Abstract
Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ...
Highlights
- Data Quality assessment is a key success point for applications using big data.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 15, Issue 3
September 2023
326 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3611329
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2023
- Online AM: 13 June 2023
- Accepted: 8 May 2023
- Revised: 23 March 2023
- Received: 16 April 2022
Published in jdiq Volume 15, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data quality
big data
context awareness
data quality assessment
Qualifiers
- survey
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 622
  Total Downloads
- Downloads (Last 12 months)622
- Downloads (Last 6 weeks)129
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Context-aware Big Data Quality Assessment: A Scoping Review

Journal of Data and Information Quality

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

BIGQA: Declarative Big Data Quality Assessment

A Data Quality in Use model for Big Data

Context-aware data quality assessment for big data