skip to main content
research-article

Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse

Authors Info & Claims
Published:27 April 2021Publication History
Skip Abstract Section

Abstract

Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary.

Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. 1995. Foundations of Databases. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Shqiponja Ahmetaj, Diego Calvanese, Magdalena Ortiz, and Mantas Simkus. 2017. Managing change in graph-structured data using description logics. ACM Trans. Comput. Log. 18, 4 (2017), 27:1–27:35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of modern query languages for graph databases. ACM Comput. Surv. 50, 5 (2017), 68:1–68:40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoč. 2016. A framework for annotating CSV-like data. Proceedings of the VLDB Endowment 9, 11 (2016), 876–887. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marcelo Arenas, Jorge Pérez, and Juan Reutter. 2013. Data exchange beyond complete data. J. ACM 60, 4 (2013), 28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Labinot Bajraktari, Magdalena Ortiz, and Mantas Simkus. 2018. Combining rules and ontologies into Clopen knowledge bases. In Proc. AAAI. 1728–1735.Google ScholarGoogle Scholar
  8. Vince Barany, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. 2014. Declarative statistical modeling with datalog. arXiv preprint arXiv:1412.2221.Google ScholarGoogle Scholar
  9. Francesco Belardinelli, Alessio Lomuscio, and Fabio Patrizi. 2014. Verification of agent-based artifact systems. J. Artif. Intell. Res. 51 (2014), 333–376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Richard Hull, Maurizio Lenzerini, and Massimo Mecella. 2005. Modeling data & processes for service specifications in Colombo. In Proceedings of the Open Interoperability Workshop on Enterprise Modelling and Ontologies for Interoperability.Google ScholarGoogle Scholar
  11. Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Richard Hull, and Massimo Mecella. 2005. Automatic composition of web services in Colombo. In Proceedings of the 13th Italian Symposium on Advanced Database Systems (SEBD’05). 8–15.Google ScholarGoogle Scholar
  12. Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Massimo Mecella. 2005. Automatic service composition based on behavioral descriptions. Int. J. Cooperative Inf. Syst. 14, 4 (2005), 333–376.Google ScholarGoogle ScholarCross RefCross Ref
  13. Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. QOCO: A query oriented data cleaning system with oracles. PVLDB 8, 12 (2015), 1900–1911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of ACM SIGMOD. 1199–1214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kamal Bhattacharya, Cagdas Gerede, Richard Hull, Rong Liu, and Jianwen Su. 2007. Towards formal analysis of artifact-centric business process models. In International Conference on Business Process Management. Springer, 288–304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Meghyn Bienvenu and Magdalena Ortiz. 2015. Ontology-mediated query answering with data-tractable description logics. In Reasoning Web International Summer School. Springer, 218–307.Google ScholarGoogle Scholar
  17. Alexander Borgida, John Mylopoulos, and Raymond Reiter. 1993. “...And nothing else changes”: The frame problem in procedure specifications. In Proceedings of the 15th International Conference on Software Engineering. 303–314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Pierre Bourhis, Juan L. Reutter, and Domagoj Vrgoc. 2020. JSON: Data model and query languages. Inf. Syst. 89 (2020), 101478.Google ScholarGoogle ScholarCross RefCross Ref
  19. Diego Calvanese, Giuseppe De Giacomo, and Marco Montali. 2013. Foundations of data-aware process analysis: A database theory perspective. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART PODS Symposium. 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Diego Calvanese, Silvio Ghilardi, Alessandro Gianola, Marco Montali, and Andrey Rivkin. 2020. SMT-based verification of data-aware processes: A model-theoretic approach. Math. Struct. Comput. Sci. 30, 3 (2020), 271–313.Google ScholarGoogle ScholarCross RefCross Ref
  21. Diego Calvanese, Giuseppe De Giacomo, Marco Montali, and Fabio Patrizi. 2018. First-order \(\)-calculus over generic transition systems and applications to the situation calculus. Inf. Comput. 259, 3 (2018), 328–347.Google ScholarGoogle ScholarCross RefCross Ref
  22. Ashok K. Chandra and Moshe Y. Vardi. 1985. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput. 14, 3 (1985), 671–677.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Krishnendu Chatterjee, Laurent Doyen, and Moshe Y. Vardi. 2015. The complexity of synthesis from probabilistic components. In Proceedings of the 42nd International Colloquium on Automata, Languages, and Programming. 108–120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rada Chirkova, Jon Doyle, and Juan L. Reutter. 2018. The data readiness problem for relational databases. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management.Google ScholarGoogle Scholar
  25. Rada Chirkova and Ting Yu. 2014. Obtaining information about queries behind views and dependencies. The Computing Research Repository (CoRR) abstract abs/1403.5199 (2014). http://arxiv.org/abs/1403.5199Google ScholarGoogle Scholar
  26. Stavros S. Cosmadakis, Paris C. Kanellakis, and Moshe Y. Vardi. 1990. Polynomial-time implication problems for unary inclusion dependencies. J. ACM 37, 1 (1990), 15–46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Giuseppe De Giacomo, Riccardo De Masellis, and Riccardo Rosati. 2012. Verification of conjunctive artifact-centric services. Int. J. Cooperative Inf. Syst. 21, 2 (2012), 111–140.Google ScholarGoogle ScholarCross RefCross Ref
  28. Giuseppe De Giacomo, Eugenia Ternovska, and Ray Reiter. 2019. Non-terminating processes in the situation calculus. Ann. Math.ematics and Artif. Intell. (2019), 1–18.Google ScholarGoogle Scholar
  29. Daniel Deutch and Tova Milo. 2012. Business Processes: A Database Perspective. Morgan & Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alin Deutsch, Richard Hull, and Victor Vianu. 2014. Automatic verification of database-centric systems. SIGMOD Record 43, 3 (2014), 5–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alin Deutsch, Yuliang Li, and Victor Vianu. 2019. Verification of hierarchical artifact systems. ACM Trans. Database Syst. (TODS) 44, 3 (2019), 1–68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alin Deutsch, Alan Nash, and Jeff Remmel. 2008. The Chase Revisited (Full Version). Technical Report. University of California, San Diego. http://db.ucsd.edu/wp-content/uploads/pdfs/303.pdf.Google ScholarGoogle Scholar
  33. Alin Deutsch and Val Tannen. 2001. Optimization properties for classes of conjunctive regular path queries. In DBPL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Devlin. 1996. Data Warehouse: From Architecture to Implementation. Addison-Wesley Longman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. J. ACM (JACM) 62, 2 (2015), 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Fagin, P. Kolaitis, R. Miller, and L. Popa. 2005. Data exchange: Semantics and query answering. Theor. Comput. Sci. 336 (2005), 89–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Babak Bagheri Hariri, Diego Calvanese, Giuseppe De Giacomo, Alin Deutsch, and Marco Montali. 2013. Verification of relational data-centric dynamic systems with external services. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART PODS Symposium. 163–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tomasz Imieliński and Witold Lipski, Jr. 1984. Incomplete information in relational databases. J. ACM (JACM) 31, 4 (1984), 761–791. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Kimball and J. Caserta. 2004. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Phokion G. Kolaitis, Jonathan Panttaja, and Wang-Chiew Tan. 2006. The complexity of data exchange. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART PODS Symposium. 30–39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, and Eugene Wu. 2015. SampleClean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull. 38, 3 (2015), 59–75.Google ScholarGoogle Scholar
  44. Leonid Libkin. 2004. Elements of Finite Model Theory. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Fangzhen Lin and Raymond Reiter. 1994. State constraints revisited. J. Log. Comput. 4, 5 (1994), 655–678.Google ScholarGoogle ScholarCross RefCross Ref
  46. Fangzhen Lin and Raymond Reiter. 1997. How to progress a database. Artif. Intell. 92, 1--2 (1997), 131–167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Carsten Lutz, Inanç Seylan, and Frank Wolter. 2015. Ontology-mediated queries with closed predicates. In IJCAI. 3120–3126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wim Martens, Frank Neven, and Stijn Vansummeren. 2015. SCULPT: A schema language for tabular data on the web. In Proceedings of the 24th International Conference on World Wide Web. ACM, 702–720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. John C. Mitchell. 1983. The implication problem for functional and inclusion dependencies. Inf. Control 56, 3 (1983), 154–173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Marco Montali, Diego Calvanese, and Giuseppe De Giacomo. 2014. Verification of data-aware commitment-based multiagent system. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS’14). 157–164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Werner Nutt, Sergey Paramonov, and Ognjen Savkovic. 2015. Implementing query completeness reasoning. In ACM CIKM. 733–742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava. 2015. Identifying the extent of completeness of query answers over partially complete databases. In ACM SIGMOD. 561–576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Raymond Reiter. 1993. Proving properties of states in the situation calculus. Artif. Intell. 64, 2 (1993), 337–351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Raymond Reiter. 1995. On specifying database updates. J. Log. Program. 25, 1 (1995), 53–91.Google ScholarGoogle ScholarCross RefCross Ref
  55. Yehoshua Sagiv and Mihalis Yannakakis. 1980. Equivalences among relational expressions with the union and difference operators. J. ACM (JACM) 27, 4 (1980), 633–655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ognjen Savkovic, Elisa Marengo, and Werner Nutt. 2016. Query stability in monotonic data-aware business processes. In Proceedings of the 19th International Conference on Database Theory (ICDT’16). 16:1–16:18.Google ScholarGoogle Scholar
  57. Richard B. Scherl and Hector J. Levesque. 2003. Knowledge, action, and the frame problem. Artif. Intell. 144, 1–2 (2003), 1–39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Juan F. Sequeda. 2017. Ontology based data access: Where do the ontologies and mappings come from? In AMW'17.Google ScholarGoogle Scholar
  59. Victor Vianu. 2009. Automatic verification of database-driven systems: A new frontier. In Proceedings of the 12th International Conference on Database Theory (ICDT’09). 1–13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Sergei Vorobyov and Andrie Voronkov. 1998. Complexity of nonrecursive logic programs with complex values. In Proceedings of the 17th ACM SIGMOD-SIGACT-SIGART PODS Symposium. 244–253. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)23
        • Downloads (Last 6 weeks)4

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format