skip to main content
survey

Context-aware Big Data Quality Assessment: A Scoping Review

Published:22 August 2023Publication History
Skip Abstract Section

Abstract

The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners.

Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging.

Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever.

This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.

REFERENCES

  1. [1] Abedjan Ziawasch, Golab Lukasz, and Naumann Felix. 2017. Data profiling: A tutorial. In Proceedings of the 2017 ACM International Conference on Management of Data (2017), 17471751.Google ScholarGoogle Scholar
  2. [2] Abedjan Ziawasch, Golab Lukasz, Naumann Felix, and Papenbrock Thorsten. 2018. Data profiling. Synthes. Lect. Data Manag. 10, 4 (2018), 1154.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Acosta Maribel, Zaveri Amrapali, Simperl Elena, Kontokostas Dimitris, Auer Sören, and Lehmann Jens. 2013. Crowdsourcing linked data quality assessment. In The Semantic Web–ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013, Proceedings, Part II 12. Springer, 260276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Agrawal Divyakant, Bernstein Philip, Bertino Elisa, Davidson Susan, Dayal Umeshwas, Franklin Michael, Gehrke Johannes, Haas Laura, Halevy Alon, Han Jiawei et al. 2011. Challenges and Opportunities with Big Data [White Paper]. Technical Report. Computing Research Association. Retrieved from http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.Google ScholarGoogle Scholar
  5. [5] Al-Jaroodi Jameela and Mohamed Nader. 2018. Service-oriented architecture for big data analytics in smart cities. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). 633640.Google ScholarGoogle Scholar
  6. [6] AlShaer Mohammed, Taher Yehia, Haque Rafiqul, Hacid Mohand-Saïd, and Dbouk Mohamed. 2019. IBRIDIA: A hybrid solution for processing big logistics data. Fut. Gen. Comput. Syst. 97 (2019), 792804.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Ardagna Danilo, Cappiello Cinzia, Samá Walter, and Vitali Monica. 2018. Context-aware data quality assessment for big data. Fut. Gen. Comput. Syst. 89 (2018), 548562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Azeroual Otmane and Abuosba Mohammad. 2019. Improving the data quality in the research information systems. arXiv preprint arXiv:1901.07388 (2019).Google ScholarGoogle Scholar
  9. [9] Bārzdiņš Jānis, Zariņš Andris, Čerāns Kārlis, Kalniņš Audris, Rencis Edgars, Lāce Lelde, Liepiņš Renārs, and Sprog̀is Artūrs. 2007. GrTP: Transformation based graphical tool building platform. In 10th International Conference on Model-driven Engineering Languages and Systems, Models.Google ScholarGoogle Scholar
  10. [10] Batini Carlo, Cabitza Federico, Cappiello Cinzia, and Francalanci Chiara. 2008. A comprehensive data quality methodology for web and structured data. Int. J. Innov. Comput. Applic. 1, 3 (2008), 205218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Batini Carlo, Rula Anisa, Scannapieco Monica, and Viscusi Gianluigi. 2015. From data quality to big data quality. J. Datab. Manag. 26, 1 (2015), 6082.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Bello Sururah A., Oyedele Lukumon O., Akinade Olugbenga O., Bilal Muhammad, Delgado Juan Manuel Davila, Akanbi Lukman A., Ajayi Anuoluwapo O., and Owolabi Hakeem A.. 2021. Cloud computing in construction industry: Use cases, benefits and challenges. Automat. Construct. 122 (2021), 103441.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Bernstein Philip A., Madhavan Jayant, and Rahm Erhard. 2011. Generic schema matching, ten years later. Proc. VLDB Endow. 4, 11 (2011), 695701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Bhimani Janki, Mi Ningfang, Leeser Miriam, and Yang Zhengyu. 2017. FiM: Performance prediction for parallel computation in iterative data processing applications. In IEEE 10th International Conference on Cloud Computing (CLOUD’17). 359366.Google ScholarGoogle Scholar
  15. [15] Bhimani Janki, Mi Ningfang, Leeser Miriam, and Yang Zhengyu. 2019. New performance modeling methods for parallel data processing applications. ACM Trans. Model. Comput. Simul. 29, 3 (2019), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Bicevska Zane, Bicevskis Janis, and Oditis Ivo. 2017. Domain-specific characteristics of data quality. Federated Conference on Computer Science and Information Systems (FedCSIS’17). 9991003.Google ScholarGoogle Scholar
  17. [17] Bicevska Zane, Bicevskis Janis, and Oditis Ivo. 2018. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference, ISM 2017, Held as Part of FedCSIS, Prague, Czech Republic, September 3–6, 2017, Extended Selected Papers 15. Springer, 194211.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Bicevskis Janis, Bicevska Zane, and Karnitis Girts. 2017. Executable data quality models. Procedia Comput. Sci. 104 (2017), 138145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Bicevskis Janis, Bicevska Zane, Nikiforova Anastasija, and Oditis Ivo. 2018. An approach to data quality evaluation. In Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS’18). 196201.Google ScholarGoogle Scholar
  20. [20] Biscobing Jacqueline. 2018. What Is Data Sampling? Retrieved from https://www.techtarget.com/searchbusinessanalytics/definition/data-sampling.Google ScholarGoogle Scholar
  21. [21] Bronselaer Antoon, Nielandt Joachim, Boeckling Toon, and Tré Guy De. 2018. Operational measurement of data quality. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11–15, 2018, Proceedings, Part III 17. Springer, 517528.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Brüggemann Stefan and Grüning Fabian. 2009. Using ontologies providing domain knowledge for data quality management. Networked Knowledge-Networked Media: Integrating Knowledge Management, New Media Technologies and Semantic Systems. Springer, 187203.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Buneman Peter and Davidson Susan B.. 2010. Data provenance–The foundation of data quality. In Workshop: Issues and Opportunities for Improving the Quality and Use of Data within the DoD, Arlington, 2628.Google ScholarGoogle Scholar
  24. [24] Cai Li and Zhu Yangyong. 2015. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015).Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Carlo Batini, Daniele Barone, Federico Cabitza, and Simone Grega. 2011. A data quality methodology for heterogeneous data. Int. J. Datab. Manag. Syst. 3, 1 (2011), 6079.Google ScholarGoogle Scholar
  26. [26] Choi O.-Hoon, Lim Jun-Eun, Na Hong-Seok, and Baik Doo-Kwon. 2008. An efficient method of data quality using quality evaluation ontology. 2008 Third International Conference on Convergence and Hybrid Information Technology 2 (2008), 10581061.Google ScholarGoogle Scholar
  27. [27] Cichy Corinna and Rass Stefan. 2019. An overview of data quality frameworks. IEEE Access 7 (2019), 2463424648.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Clarke Roger. 2014. Quality Factors in Big Data and Big Data Analytics. Xamax Consultancy Pty Ltd.Google ScholarGoogle Scholar
  29. [29] Cormode Graham and Duffield Nick. 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 19751975.Google ScholarGoogle Scholar
  30. [30] Corporation Microsoft. 2013. Data Quality Services. Retrieved from https://docs.microsoft.com/en-us/sql/data-quality-services/data-quality-services?view=sql-server-ver15.Google ScholarGoogle Scholar
  31. [31] Corporation Microsoft. 2018. SQL Server Integration Services. Retrieved from https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-ver15.Google ScholarGoogle Scholar
  32. [32] Corporation Oracle. 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper]. Technical Report. Oracle Corporation. Retrieved from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.Google ScholarGoogle Scholar
  33. [33] Dai Wei, Wardlaw Isaac, Cui Yu, Mehdi Kashif, Li Yanyan, and Long Jun. 2016. Data profiling technology of data governance regarding big data: Review and rethinking. In Information Technology: New Generations: 13th International Conference on Information Technology. Springer, 439450.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Dai Wei, Yoshigoe Kenji, and Parsley William. 2018. Improving data quality through deep learning and statistical models. In Information Technology-New Generations: 14th International Conference on Information Technology. 515522.Google ScholarGoogle Scholar
  35. [35] Daki Houda, Hannani Asmaa El, Aqqal Abdelhak, Haidine Abdelfattah, and Dahbi Aziz. 2017. Big Data management in smart grid: Concepts, requirements and implementation. J. Big Data 4, 1 (2017), 119.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Dean Jeffrey and Ghemawat Sanjay. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1, 107113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Dhayne Houssein, Haque Rafiqul, Kilany Rima, and Taher Yehia. 2019. In search of big medical data integration solutions—A comprehensive survey. IEEE Access 7 (2019), 9126591290.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Dmitriyev Viktor, Mahmoud Tariq, and Marín-Ortega Pablo Michel. 2015. Int. J. Inf. Syst. Proj. Manag. 3, 3 (2015), 4963.Google ScholarGoogle Scholar
  39. [39] Dong Xin Luna, Berti-Equille Laure, and Srivastava Divesh. 2013. Data fusion: Resolving conflicts from multiple sources. Handbook of Data Quality: Research and Practice. Springer, 293318.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Dong Xin Luna and Srivastava Divesh. 2013. Big data integration. In IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 12451248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Dragoni Nicola, Lanese Ivan, Larsen Stephan Thordal, Mazzara Manuel, Mustafin Ruslan, and Safina Larisa. 2018. Microservices: How to make your application scale. In Perspectives of System Informatics: 11th International Andrei P. Ershov Informatics Conference, PSI 2017, Moscow, Russia, June 27–29, 2017, Revised Selected Papers 11. Springer, 95104.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Durairaj M. and Poornappriya T. S.. 2018. Importance of MapReduce for big data applications: A survey. Asian J. Comput. Sci. Technol. 7, 1 (2018), 112118.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Ehrlinger Lisa, Werth Bernhard, and Wöß Wolfram. 2018. Automated continuous data quality measurement with QuaIIe. Int. J. Advanc. Softw. 11, 3 (2018), 400417.Google ScholarGoogle Scholar
  44. [44] Ehrlinger Lisa, Werth Bernhard, and Wöß Wolfram. 2018. QuaIIe: A data quality assessment tool for integrated information systems. In 10th International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA’18). 2131.Google ScholarGoogle Scholar
  45. [45] Ehrlinger Lisa and Wöß Wolfram. 2017. Automated data quality monitoring. In 22nd MIT International Conference on Information Quality (ICIQ’17). 15–1.Google ScholarGoogle Scholar
  46. [46] Even Adir and Shankaranarayanan Ganesan. 2005. Value-driven data quality assessment. In International Conference on Information Quality (ICIQ’05).Google ScholarGoogle Scholar
  47. [47] Even Adir and Shankaranarayanan Ganesan. 2007. Utility-driven assessment of data quality. ACM SIGMIS Datab.: DATAB. Adv. Inf. Syst. 38, 2 (2007), 7593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Fadlallah Hadi, Taher Yehia, Haque Rafiqul, and Jaber Ali. 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’19). 5256.Google ScholarGoogle Scholar
  49. [49] Fadlallah Hadi, Taher Yehia, and Jaber Ali. 2018. RaDEn: A scalable and efficient radiation data engineering. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’18). 8993.Google ScholarGoogle Scholar
  50. [50] Salas Óscar Figuerola, Adzic Velibor, Shah Akash, and Kalva Hari. 2013. Assessing internet video quality using crowdsourcing. In 2nd ACM International Workshop on Crowdsourcing for Multimedia. 2328.Google ScholarGoogle Scholar
  51. [51] Finkel Jenny Rose, Grenager Trond, and Manning Christopher D.. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363370.Google ScholarGoogle Scholar
  52. [52] Gao Jerry, Xie Chunli, and Tao Chuanqi. 2016. Big data validation and quality assuranceIssues, challenges, and needs. In IEEE symposium on service-oriented system engineering (SOSE16). 433441.Google ScholarGoogle Scholar
  53. [53] Ge Mouzhi and Helfert Markus. 2007. A review of information quality research-develop a research agenda. In International Conference on Information Quality (ICIQ’07). 7691.Google ScholarGoogle Scholar
  54. [54] Gu Rong, Qi Yang, Wu Tongyu, Wang Zhaokang, Xu Xiaolong, Yuan Chunfeng, and Huang Yihua. 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. ParallelDistrib. Comput. 156 (2021), 132147.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Gudivada Venkat, Apon Amy, and Ding Junhua. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Advanc. Softw. 10, 1 (2017), 120.Google ScholarGoogle Scholar
  56. [56] Gudivada Venkat N., Rao Dhana, and Grosky William I.. 2016. Data quality centric application framework for big data. In International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16).Google ScholarGoogle Scholar
  57. [57] Hariri Reihaneh H., Fredericks Erik M., and Bowers Kate M.. 2019. Uncertainty in big data analytics: Survey, opportunities, and challenges. J. Big Data 6, 1 (2019), 116.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Hasselbring Wilhelm. 2016. Microservices for scalability: Keynote talk abstract. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. 133134.Google ScholarGoogle Scholar
  59. [59] Hay Brian, Nance Kara, and Bishop Matt. 2011. Storm clouds rising: Security challenges for IaaS cloud computing. In 2011 44th Hawaii International Conference on System Sciences. 17.Google ScholarGoogle Scholar
  60. [60] He Qinlu, Li Zhanhuai, and Zhang Xiao. 2010. Data deduplication techniques. In 2010 International Conference on Future Information Technology and Management Engineering 1 (2010), 430433.Google ScholarGoogle Scholar
  61. [61] He Qing, Wang Haocheng, Zhuang Fuzhen, Shang Tianfeng, and Shi Zhongzhi. 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Helfert Markus and Foley Owen. 2009. A context aware information quality framework. In 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187193.Google ScholarGoogle Scholar
  63. [63] Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard de, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, et al. 2021. Knowledge graphs. ACM Comput. Surv. 54, 4 (2021), 137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Hosseini Kasra, Nanni Federico, and Ardanuy Mariona Coll. 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 6269.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Hoßfeld Tobias, Hirth Matthias, Korshunov Pavel, Hanhart Philippe, Gardlo Bruno, Keimel Christian, and Timmerer Christian. 2014. Survey of web-based crowdsourcing frameworks for subjective quality assessment. In IEEE 16th International Workshop on Multimedia Signal Processing (MMSP’14). 16.Google ScholarGoogle Scholar
  66. [66] Ilyas Ihab F. and Chu Xu. 2019. Data Cleaning. ACM New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Immonen Anne, Pääkkönen Pekka, and Ovaska Eila. 2015. Evaluating the quality of social media data in big data architecture. IEEE Access 3 (2015), 20282043.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Inc. Talend2022. Data Quality and Machine Learning: What’s the Connection? Retrieved from https://www.talend.com/resources/machine-learning-data-quality/.Google ScholarGoogle Scholar
  69. [69] Informatica. 2018. Informatica Data Quality Data Sheet. Technical Report. Informatica. Retrieved from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.Google ScholarGoogle Scholar
  70. [70] Iqbal Muhammad Hussain, Soomro Tariq Rahim et al. 2015. Big data analysis: Apache Storm perspective. Int. J. Comput. Trends Technol. 19, 1 (2015), 914.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] ISO/IEC. 2001. ISO/IEC 9126-1:2001. Software Engineering – Product Quality – Part 1: Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/22749.html.Google ScholarGoogle Scholar
  72. [72] ISO/IEC. 2008. 25012:2008 Software Engineering – Software Product Quality Requirements and Evaluation (SQuaRE) – Data Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35736.html.Google ScholarGoogle Scholar
  73. [73] ISO/IEC. 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/64764.html.Google ScholarGoogle Scholar
  74. [74] ISO/IEC. 2015. ISO/IEC 25024:2015 Systems and Software Engineering – Systems and Software Quality Requirements and Evaluation (SQuaRE) – Measurement of Data Quality. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35749.html.Google ScholarGoogle Scholar
  75. [75] ISO/IEC. 2017. ISO/IEC 15939:2017 Systems and Software Engineering – Measurement Process. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71197.html.Google ScholarGoogle Scholar
  76. [76] ISO/IEC. 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71277.html.Google ScholarGoogle Scholar
  77. [77] ISO/IEC. 2022. ISO/IEC AWI 5259-1 Artificial Intelligence – Data Quality for Analytics and Machine Learning (ML) – Part 1: Overview, Terminology, and Examples. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/81088.html.Google ScholarGoogle Scholar
  78. [78] ISO/TS. 2011. ISO/TS 8000-1:2011 - Data Quality - Part 1: Overview. Standard. ISO/TS. Retrieved from https://www.iso.org/standard/50798.html.Google ScholarGoogle Scholar
  79. [79] Iverson Michael A., Ozguner Fusun, and Potter Lee C.. 1999. Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In Proceedings Eighth Heterogeneous Computing Workshop (HCW’99). 99111.Google ScholarGoogle Scholar
  80. [80] Ji Changqing, Li Yu, Qiu Wenming, Awada Uchechukwu, and Li Keqiu. 2012. Big data processing in cloud computing environments. In 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks (2012), 1723.Google ScholarGoogle Scholar
  81. [81] Kadadi Anirudh, Agrawal Rajeev, Nyamful Christopher, and Atiq Rahman. 2014. Challenges of data integration and interoperability in big data. In 2014 IEEE International Conference on Big Data (big data) (2014), 3840.Google ScholarGoogle Scholar
  82. [82] Kaiser Jiří. 2014. Dealing with missing values in data. J. Syst. Integr. 5, 1 (2014) 42–51.Google ScholarGoogle Scholar
  83. [83] Karami Amir, Gangopadhyay Aryya, Zhou Bin, and Kharrazi Hadi. 2015. A fuzzy approach model for uncovering hidden latent semantic structure in medical text collections. In iConference 2015.Google ScholarGoogle Scholar
  84. [84] Karmakar Anurag, Raghuthaman Anaswara, Kote Om Sudhakar, and Jayapandian N.. 2022. Cloud computing application: Research challenges and opportunity. In International Conference on Sustainable Computing and Data Communication Systems (ICSCDS’22). IEEE, 12841289.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Khayyat Zuhair, Ilyas Ihab F., Jindal Alekh, Madden S., Ouzzani M., Papotti Paolo, Quiané-Ruiz Jorge-Arnulfo, Tang Nan, and Yin Si. 2015. BigDansing: A system for big data cleansing. In SIGMOD Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. [86] Kim Jae Kwang and Wang Zhonglei. 2019. Sampling techniques for big data analysis. Int. Statist. Rev. 87 (2019), S177–S191.Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Kontokostas Dimitris, Zaveri Amrapali, Auer Sören, and Lehmann Jens. 2013. TripleCheckMate: A tool for crowdsourcing the quality assessment of linked data. In Knowledge Engineering and the Semantic Web: 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7–9, 2013. Proceedings 4. Springer, 265272.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Kumar Pradeep, Bhatnagar Roheet, Gaur Kuntal, and Bhatnagar Anurag. 2021. Classification of imbalanced data: Review of methods and applications. IOP Conference Series: Materials Science and Engineering 1099, 1 (2021), 012077.Google ScholarGoogle Scholar
  89. [89] Kusumasari Tien Fabrianti et al. 2016. Data profiling for data quality improvement with OpenRefine. In International Conference on Information Technology Systems and Innovation (ICITSI’16). 16.Google ScholarGoogle Scholar
  90. [90] Leung Hareton K. N.. 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137152.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Liu Zhicheng and Zhang Aoqian. 2020. Sampling for big data profiling: A survey. IEEE Access 8 (2020), 7271372726.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] L’Heureux Alexandra, Grolinger Katarina, Elyamany Hany F., and Capretz Miriam A. M.. 2017. Machine learning with big data: Challenges and approaches. IEEE Access 5 (2017), 77767797.Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Malhotra Jyoti and Bakal Jagdish. 2015. A survey and comparative study of data deduplication techniques. In International Conference on Pervasive Computing (ICPC’15). 15.Google ScholarGoogle Scholar
  94. [94] McKelvey Nigel, Curran Kevin, and Toland Luke. 2016. The Challenges of Data Cleansing with Data Warehouses. 7782. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  95. [95] Mehrtak Mohammad, SeyedAlinaghi SeyedAhmad, MohsseniPour Mehrzad, Noori Tayebeh, Karimi Amirali, Shamsabadi Ahmadreza, Heydari Mohammad, Barzegary Alireza, Mirzapour Pegah, Soleymanzadeh Mahdi, et al. 2021. Security challenges and solutions using healthcare cloud computing. J. Med. Life 14, 4 (2021), 448.Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Merino Jorge, Caballero Ismael, Rivas Bibiano, Serrano Manuel, and Piattini Mario. 2016. A data quality in use model for big data. Fut. Gen. Comput. Syst. 63 (2016), 123130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. [97] Mihindukulasooriya Nandana, García-Castro Raúl, Priyatna Freddy, Ruckhaus Edna, and Saturno Nelson. 2017. A linked data profiling service for quality assessment. In The Semantic Web: ESWC 2017 Satellite Events: ESWC 2017 Satellite Events, Portorož, Slovenia, May 28–June 1, 2017, Revised Selected Papers 14. Springer, 335340.Google ScholarGoogle Scholar
  98. [98] Missier Paolo, Embury Suzanne, Greenwood Mark, Preece Alun, and Jin Binling. 2006. Quality views: Capturing and exploiting the user perspective on data quality. In International Conference on Very Large Data Bases.Google ScholarGoogle Scholar
  99. [99] Mousannif Hajar, Sabah Hasna, Douiji Yasmina, and Sayad Younes Oulad. 2014. From big data to big projects: A step-by-step roadmap. In 2014 International Conference on Future Internet of Things and Cloud. 373378.Google ScholarGoogle Scholar
  100. [100] Munn Zachary, Peters Micah D. J., Stern Cindy, Tufanaru Catalin, McArthur Alexa, and Aromataris Edoardo. 2018. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 18 (2018), 17.Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Mylavarapu Goutam, Thomas Johnson P., and Viswanathan K. Ashwin. 2019. An automated big data accuracy assessment tool. In IEEE 4th International Conference on Big Data Analytics (ICBDA’19). 193197.Google ScholarGoogle Scholar
  102. [102] Mylavarapu Goutam, Viswanathan K. Ashwin, and Thomas Johnson P.. 2019. Assessing context-aware data consistency. In IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA’19). 16.Google ScholarGoogle Scholar
  103. [103] Najafabadi Maryam M., Villanustre Flavio, Khoshgoftaar Taghi M., Seliya Naeem, Wald Randall, and Muharemagic Edin. 2015. Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015), 121.Google ScholarGoogle ScholarCross RefCross Ref
  104. [104] Nargesian Fatemeh, Zhu Erkang, Miller Renée J., Pu Ken Q., and Arocena Patricia C.. 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 19861989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. [105] Naumann Felix. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 4 (2014), 4049.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. [106] Niemelä Eila, Evesti Antti, and Savolainen Pekka. 2008. Modeling quality attribute variability. In International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE’08). 169176.Google ScholarGoogle Scholar
  107. [107] Nikiforova Anastasija and Bicevskis Janis. 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In International Conference on Enterprise Information Systems (ICEIS’19). 274281.Google ScholarGoogle ScholarCross RefCross Ref
  108. [108] Nikiforova Anastasija, Bicevskis Janis, Bicevska Zane, and Oditis Ivo. 2020. User-oriented approach to data quality evaluation. J. Univers. Comput. Sci. 26, 1 (2020), 107126.Google ScholarGoogle ScholarCross RefCross Ref
  109. [109] Pääkkönen Pekka and Pakkala Daniel. 2015. Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 4 (2015), 166186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. [110] Patel-Schneider Peter F.. 2015. Towards large-scale schema and ontology matching. Retrieved from https://www.semanticscholar.org/paper/Towards-Large-scale-Schema-And-Ontology-Matching-Patel-Schneider/ceee2bdaef83a0f09480fa6fb191cf3372137152.Google ScholarGoogle Scholar
  111. [111] Pérez Beatriz, Rubio Julio, and Sáenz-Adán Carlos. 2018. A systematic review of provenance systems. Knowl. Inf. Syst. 57 (2018), 495543.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. [112] Pipino Leo L., Lee Yang W., and Wang Richard Y.. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. [113] Price Rosanne, Neiger Dina, and Shanks Graeme. 2008. Developing a measurement instrument for subjective aspects of information quality. Commun. Assoc. Inf. Syst. 22, 1 (2008), 3.Google ScholarGoogle Scholar
  114. [114] Rahul Kumar and Banyal R. K.. 2019. Data cleaning mechanism for big data and cloud computing. In 6th International Conference on Computing for Sustainable Global Development (INDIACom’19). 195198.Google ScholarGoogle Scholar
  115. [115] Ramaswamy Lakshmish, Lawson Victor, and Gogineni Siva Venkat. 2013. Towards a quality-centric big data architecture for federated sensor services. In 2013 IEEE International Congress on Big Data. 8693.Google ScholarGoogle Scholar
  116. [116] Rawat R. and Yadav R.. 2021. Big data: Big data analysis, issues and challenges and technologies. IOP Conference Series: Materials Science and Engineering 1022, 1 (2021), 012014.Google ScholarGoogle Scholar
  117. [117] Sadineni Praveen Kumar. 2020. Sampling based join-aggregate query processing technique for big data. Indian J. Comput. Sci. Eng. 11, 5, 532546.Google ScholarGoogle ScholarCross RefCross Ref
  118. [118] Saha Barna and Srivastava Divesh. 2014. Data quality: The other face of big data. In 2014 IEEE 30th International Conference on Data Engineering. 12941297.Google ScholarGoogle Scholar
  119. [119] Schelter Sebastian, Lange Dustin, Schmidt Philipp, Celikel Meltem, Biessmann Felix, and Grafberger Andreas. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 17811794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. [120] Sharma Gaurav. 2021. Data Quality. Retrieved from https://www.computer.org/publications/tech-news/trends/big-data-and-cloud-computing.Google ScholarGoogle Scholar
  121. [121] Siegmund Norbert, Rosenmüller Marko, Kuhlemann Martin, Kästner Christian, Apel Sven, Duchateau Fabien, and Fagnan Justin. 2015. Schema matching bibtex. In Proceedings of the VLDB Endowment.Google ScholarGoogle Scholar
  122. [122] Software Calidad. 2022. ISO/IEC 25012. Retrieved from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.Google ScholarGoogle Scholar
  123. [123] Stojanović Dragan, Stojanović Natalija, and Turanjanin Jovan. 2015. Processing big trajectory and Twitter data streams using Apache STORM. (2015), 301304. Retrieved from https://www.semanticscholar.org/paper/Schema-Matching-Bibtex-Siegmund-Rosenm%C3%BCller/a4d94ddaab429e5874386dd29822e470b57d6ee4.Google ScholarGoogle Scholar
  124. [124] Strong Diane M., Lee Yang W., and Wang Richard Y.. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. [125] Taher Yehia, Haque Rafiqul, AlShaer Mohammed, Heuvel Willem Jan van den, Hacid Mohand-Saïd, and Dbouk Mohamed. 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24–28, 2016, Proceedings. Springer, 910917.Google ScholarGoogle ScholarCross RefCross Ref
  126. [126] Taher Yehia, Haque Rafiqul, and Hacid Mohand-Said. 2017. BDLaaS: Big data lab as a service for experimenting big data solution. In IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W’17). 155159.Google ScholarGoogle Scholar
  127. [127] Taleb Ikbal, Dssouli Rachida, and Serhani Mohamed Adel. 2015. Big data pre-processing: A quality framework. (2015), 191198.Google ScholarGoogle Scholar
  128. [128] Taleb Ikbal, Serhani Mohamed Adel, and Dssouli Rachida. 2018. Big data quality assessment model for unstructured data. In International Conference on Innovations in Information Technology (IIT’18). 6974.Google ScholarGoogle Scholar
  129. [129] Taleb Ikbal, Serhani Mohamed Adel, and Dssouli Rachida. 2019. Big data quality: A data quality profiling model. In Services–SERVICES 2019: 15th World Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 15. Springer, 6177.Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. [130] Talend. 2020. How to Manage Modern Data Quality [White Paper]. Technical Report. Talend. Retrieved from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.Google ScholarGoogle Scholar
  131. [131] Talha Mohamed, Elmarzouqi Nabil, and Kalam Anas Abou El. 2020. Towards a powerful solution for data accuracy assessment in the big data context. Int. J. Advanc. Comput. Sci. Applic. 11, 2 (2020).Google ScholarGoogle Scholar
  132. [132] Venkataraman Shivaram, Yang Zongheng, Franklin Michael, Recht Benjamin, and Stoica Ion. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 363378.Google ScholarGoogle Scholar
  133. [133] Wang Lidong and Alexander Cheryl Ann. 2016. Machine learning in big data. Int. J. Math., Eng. Manag. Sci. 1, 2 (2016), 5261.Google ScholarGoogle Scholar
  134. [134] Wang Richard Y.. 1998. A product perspective on total data quality management. Commun. ACM 41, 2 (1998), 5865.Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. [135] Wang Richard Y. and Strong Diane. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12 (1996), 533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. [136] Wang Xinxin, Dang Depeng, and Guo Zixian. 2020. Evaluating the crowd quality for subjective questions based on a Spark computing environment. Fut. Gen. Comput. Syst. 106 (2020), 426437.Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. [137] Wei-Liang Chen, Shi-Dong Zhang, and Xiang Gao. 2009. Anchoring the consistency dimension of data quality using ontology in data integration. (2009), 201205.Google ScholarGoogle Scholar
  138. [138] Woodall Philip, Oberhofer Martin, and Borek Alexander. 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298321.Google ScholarGoogle Scholar
  139. [139] Zaslavsky Arkady, Perera Charith, and Georgakopoulos Dimitrios. 2013. Sensing as a service and big data. arXiv preprint arXiv:1301.0159 (2013).Google ScholarGoogle Scholar
  140. [140] Zaveri Amrapali, Kontokostas Dimitris, Sherif Mohamed A., Bühmann Lorenz, Morsey Mohamed, Auer Sören, and Lehmann Jens. 2013. User-driven quality evaluation of DBpedia. In 9th International Conference on Semantic Systems. 97104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. [141] Zhang Pengcheng, Zhou Xuewu, Li Wenrui, and Gao Jerry. 2017. A survey on quality assurance techniques for big data applications. (2017), 313319.Google ScholarGoogle Scholar
  142. [142] Zhang Zhenrong, Zhang Jianshu, Du Jun, and Wang Fengren. 2022. Split, embed and merge: An accurate table structure recognizer. Pattern Recognit. 126 (2022), 108565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. [143] Zhou Lina, Pan Shimei, Wang Jianwu, and Vasilakos Athanasios V.. 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing 237 (2017), 350361.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Context-aware Big Data Quality Assessment: A Scoping Review

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 15, Issue 3
      September 2023
      326 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/3611329
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2023
      • Online AM: 13 June 2023
      • Accepted: 8 May 2023
      • Revised: 23 March 2023
      • Received: 16 April 2022
      Published in jdiq Volume 15, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • survey
    • Article Metrics

      • Downloads (Last 12 months)622
      • Downloads (Last 6 weeks)129

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text