Abstract
Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary.
Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.
- S. Abiteboul, R. Hull, and V. Vianu. 1995. Foundations of Databases. Addison-Wesley. Google ScholarDigital Library
- Shqiponja Ahmetaj, Diego Calvanese, Magdalena Ortiz, and Mantas Simkus. 2017. Managing change in graph-structured data using description logics. ACM Trans. Comput. Log. 18, 4 (2017), 27:1–27:35. Google ScholarDigital Library
- Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of modern query languages for graph databases. ACM Comput. Surv. 50, 5 (2017), 68:1–68:40. Google ScholarDigital Library
- Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoč. 2016. A framework for annotating CSV-like data. Proceedings of the VLDB Endowment 9, 11 (2016), 876–887. Google ScholarDigital Library
- Marcelo Arenas, Jorge Pérez, and Juan Reutter. 2013. Data exchange beyond complete data. J. ACM 60, 4 (2013), 28. Google ScholarDigital Library
- Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. Google ScholarDigital Library
- Labinot Bajraktari, Magdalena Ortiz, and Mantas Simkus. 2018. Combining rules and ontologies into Clopen knowledge bases. In Proc. AAAI. 1728–1735.Google Scholar
- Vince Barany, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. 2014. Declarative statistical modeling with datalog. arXiv preprint arXiv:1412.2221.Google Scholar
- Francesco Belardinelli, Alessio Lomuscio, and Fabio Patrizi. 2014. Verification of agent-based artifact systems. J. Artif. Intell. Res. 51 (2014), 333–376. Google ScholarDigital Library
- Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Richard Hull, Maurizio Lenzerini, and Massimo Mecella. 2005. Modeling data & processes for service specifications in Colombo. In Proceedings of the Open Interoperability Workshop on Enterprise Modelling and Ontologies for Interoperability.Google Scholar
- Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Richard Hull, and Massimo Mecella. 2005. Automatic composition of web services in Colombo. In Proceedings of the 13th Italian Symposium on Advanced Database Systems (SEBD’05). 8–15.Google Scholar
- Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Massimo Mecella. 2005. Automatic service composition based on behavioral descriptions. Int. J. Cooperative Inf. Syst. 14, 4 (2005), 333–376.Google ScholarCross Ref
- Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. QOCO: A query oriented data cleaning system with oracles. PVLDB 8, 12 (2015), 1900–1911. Google ScholarDigital Library
- Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of ACM SIGMOD. 1199–1214. Google ScholarDigital Library
- Kamal Bhattacharya, Cagdas Gerede, Richard Hull, Rong Liu, and Jianwen Su. 2007. Towards formal analysis of artifact-centric business process models. In International Conference on Business Process Management. Springer, 288–304. Google ScholarDigital Library
- Meghyn Bienvenu and Magdalena Ortiz. 2015. Ontology-mediated query answering with data-tractable description logics. In Reasoning Web International Summer School. Springer, 218–307.Google Scholar
- Alexander Borgida, John Mylopoulos, and Raymond Reiter. 1993. “...And nothing else changes”: The frame problem in procedure specifications. In Proceedings of the 15th International Conference on Software Engineering. 303–314. Google ScholarDigital Library
- Pierre Bourhis, Juan L. Reutter, and Domagoj Vrgoc. 2020. JSON: Data model and query languages. Inf. Syst. 89 (2020), 101478.Google ScholarCross Ref
- Diego Calvanese, Giuseppe De Giacomo, and Marco Montali. 2013. Foundations of data-aware process analysis: A database theory perspective. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART PODS Symposium. 1–12. Google ScholarDigital Library
- Diego Calvanese, Silvio Ghilardi, Alessandro Gianola, Marco Montali, and Andrey Rivkin. 2020. SMT-based verification of data-aware processes: A model-theoretic approach. Math. Struct. Comput. Sci. 30, 3 (2020), 271–313.Google ScholarCross Ref
- Diego Calvanese, Giuseppe De Giacomo, Marco Montali, and Fabio Patrizi. 2018. First-order \(\)-calculus over generic transition systems and applications to the situation calculus. Inf. Comput. 259, 3 (2018), 328–347.Google ScholarCross Ref
- Ashok K. Chandra and Moshe Y. Vardi. 1985. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput. 14, 3 (1985), 671–677.Google ScholarDigital Library
- Krishnendu Chatterjee, Laurent Doyen, and Moshe Y. Vardi. 2015. The complexity of synthesis from probabilistic components. In Proceedings of the 42nd International Colloquium on Automata, Languages, and Programming. 108–120. Google ScholarDigital Library
- Rada Chirkova, Jon Doyle, and Juan L. Reutter. 2018. The data readiness problem for relational databases. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management.Google Scholar
- Rada Chirkova and Ting Yu. 2014. Obtaining information about queries behind views and dependencies. The Computing Research Repository (CoRR) abstract abs/1403.5199 (2014). http://arxiv.org/abs/1403.5199Google Scholar
- Stavros S. Cosmadakis, Paris C. Kanellakis, and Moshe Y. Vardi. 1990. Polynomial-time implication problems for unary inclusion dependencies. J. ACM 37, 1 (1990), 15–46. Google ScholarDigital Library
- Giuseppe De Giacomo, Riccardo De Masellis, and Riccardo Rosati. 2012. Verification of conjunctive artifact-centric services. Int. J. Cooperative Inf. Syst. 21, 2 (2012), 111–140.Google ScholarCross Ref
- Giuseppe De Giacomo, Eugenia Ternovska, and Ray Reiter. 2019. Non-terminating processes in the situation calculus. Ann. Math.ematics and Artif. Intell. (2019), 1–18.Google Scholar
- Daniel Deutch and Tova Milo. 2012. Business Processes: A Database Perspective. Morgan & Claypool Publishers. Google ScholarDigital Library
- Alin Deutsch, Richard Hull, and Victor Vianu. 2014. Automatic verification of database-centric systems. SIGMOD Record 43, 3 (2014), 5–17. Google ScholarDigital Library
- Alin Deutsch, Yuliang Li, and Victor Vianu. 2019. Verification of hierarchical artifact systems. ACM Trans. Database Syst. (TODS) 44, 3 (2019), 1–68. Google ScholarDigital Library
- Alin Deutsch, Alan Nash, and Jeff Remmel. 2008. The Chase Revisited (Full Version). Technical Report. University of California, San Diego. http://db.ucsd.edu/wp-content/uploads/pdfs/303.pdf.Google Scholar
- Alin Deutsch and Val Tannen. 2001. Optimization properties for classes of conjunctive regular path queries. In DBPL. Google ScholarDigital Library
- B. Devlin. 1996. Data Warehouse: From Architecture to Implementation. Addison-Wesley Longman. Google ScholarDigital Library
- AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. Google ScholarDigital Library
- Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. J. ACM (JACM) 62, 2 (2015), 12. Google ScholarDigital Library
- R. Fagin, P. Kolaitis, R. Miller, and L. Popa. 2005. Data exchange: Semantics and query answering. Theor. Comput. Sci. 336 (2005), 89–124. Google ScholarDigital Library
- Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers. Google ScholarDigital Library
- Babak Bagheri Hariri, Diego Calvanese, Giuseppe De Giacomo, Alin Deutsch, and Marco Montali. 2013. Verification of relational data-centric dynamic systems with external services. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART PODS Symposium. 163–174. Google ScholarDigital Library
- Tomasz Imieliński and Witold Lipski, Jr. 1984. Incomplete information in relational databases. J. ACM (JACM) 31, 4 (1984), 761–791. Google ScholarDigital Library
- R. Kimball and J. Caserta. 2004. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley. Google ScholarDigital Library
- Phokion G. Kolaitis, Jonathan Panttaja, and Wang-Chiew Tan. 2006. The complexity of data exchange. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART PODS Symposium. 30–39. Google ScholarDigital Library
- Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, and Eugene Wu. 2015. SampleClean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull. 38, 3 (2015), 59–75.Google Scholar
- Leonid Libkin. 2004. Elements of Finite Model Theory. Springer. Google ScholarDigital Library
- Fangzhen Lin and Raymond Reiter. 1994. State constraints revisited. J. Log. Comput. 4, 5 (1994), 655–678.Google ScholarCross Ref
- Fangzhen Lin and Raymond Reiter. 1997. How to progress a database. Artif. Intell. 92, 1--2 (1997), 131–167. Google ScholarDigital Library
- Carsten Lutz, Inanç Seylan, and Frank Wolter. 2015. Ontology-mediated queries with closed predicates. In IJCAI. 3120–3126. Google ScholarDigital Library
- Wim Martens, Frank Neven, and Stijn Vansummeren. 2015. SCULPT: A schema language for tabular data on the web. In Proceedings of the 24th International Conference on World Wide Web. ACM, 702–720. Google ScholarDigital Library
- John C. Mitchell. 1983. The implication problem for functional and inclusion dependencies. Inf. Control 56, 3 (1983), 154–173. Google ScholarDigital Library
- Marco Montali, Diego Calvanese, and Giuseppe De Giacomo. 2014. Verification of data-aware commitment-based multiagent system. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS’14). 157–164. Google ScholarDigital Library
- Werner Nutt, Sergey Paramonov, and Ognjen Savkovic. 2015. Implementing query completeness reasoning. In ACM CIKM. 733–742. Google ScholarDigital Library
- Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava. 2015. Identifying the extent of completeness of query answers over partially complete databases. In ACM SIGMOD. 561–576. Google ScholarDigital Library
- Raymond Reiter. 1993. Proving properties of states in the situation calculus. Artif. Intell. 64, 2 (1993), 337–351. Google ScholarDigital Library
- Raymond Reiter. 1995. On specifying database updates. J. Log. Program. 25, 1 (1995), 53–91.Google ScholarCross Ref
- Yehoshua Sagiv and Mihalis Yannakakis. 1980. Equivalences among relational expressions with the union and difference operators. J. ACM (JACM) 27, 4 (1980), 633–655. Google ScholarDigital Library
- Ognjen Savkovic, Elisa Marengo, and Werner Nutt. 2016. Query stability in monotonic data-aware business processes. In Proceedings of the 19th International Conference on Database Theory (ICDT’16). 16:1–16:18.Google Scholar
- Richard B. Scherl and Hector J. Levesque. 2003. Knowledge, action, and the frame problem. Artif. Intell. 144, 1–2 (2003), 1–39. Google ScholarDigital Library
- Juan F. Sequeda. 2017. Ontology based data access: Where do the ontologies and mappings come from? In AMW'17.Google Scholar
- Victor Vianu. 2009. Automatic verification of database-driven systems: A new frontier. In Proceedings of the 12th International Conference on Database Theory (ICDT’09). 1–13. Google ScholarDigital Library
- Sergei Vorobyov and Andrie Voronkov. 1998. Complexity of nonrecursive logic programs with complex values. In Proceedings of the 17th ACM SIGMOD-SIGACT-SIGART PODS Symposium. 244–253. Google ScholarDigital Library
Index Terms
- Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse
Recommendations
Requirements for Data Quality Metrics
Challenge Paper, Experience Paper and Research PaperData quality and especially the assessment of data quality have been intensively discussed in research and practice alike. To support an economically oriented management of data quality and decision making under uncertainty, it is essential to assess ...
Capabilities and Readiness for Big Data Analytics
AbstractDespite some of the initial hype from marketers and consultants, the use of big data is now firmly established in many organisations worldwide. Big data analytics (BDA) is making use of huge volumes of data from a wide range of structured and ...
Accounting for quality in data integration systems: a completeness-aware integration approach
AbstractEnsuring the quality of integrated data is undoubtedly one of the main problems of integrated data systems. When focusing on multi-national and historical data integration systems, where the “space” and “time” dimensions play a relevant role, it ...
Comments