Incremental discovery of denial constraints

Qian, Chaoqin; Li, Menglu; Tan, Zijing; Ran, Ai; Ma, Shuai

doi:10.1007/s00778-023-00788-y

Incremental discovery of denial constraints

Regular Paper
Published: 17 March 2023

Volume 32, pages 1289–1313, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Chaoqin Qian¹,
Menglu Li¹,
Zijing Tan ORCID: orcid.org/0000-0001-6332-780X¹,
Ai Ran¹ &
…
Shuai Ma²

419 Accesses
Explore all metrics

Abstract

We investigate the problem of incremental denial constraint (DC) discovery, aiming at discovering DCs in response to a set $\triangle $r of tuple insertions to a given relational instance r and the known set $\varSigma $ of DCs holding on r. The need for the study is evident since real-life data are often frequently updated, and it is often prohibitively expensive to perform DC discovery from scratch for every update. We tackle this problem with two steps. We first employ indexing techniques to efficiently identify the incremental evidences caused by $\triangle r$. We present algorithms to build indexes for $\varSigma $ and r in the pre-processing step, and to visit and update indexes in response to $\triangle $r. In particular, we propose a novel indexing technique for two inequality comparisons possibly across the attributes of r. By leveraging the indexes, we can identify all the tuple pairs incurred by $\triangle $r that simultaneously satisfy the two comparisons, with a cost dependent on log($|$r$|$). We then compute the changes $\triangle \varSigma $ to $\varSigma $ based on the incremental evidences, such that $\varSigma \oplus \triangle \varSigma $ is the set of DCs holding on $r+\triangle r$. $\triangle \varSigma $ may contain new DCs that are added into $\varSigma $ and obsolete DCs that are removed from $\varSigma $. Our experimental evaluations show that our incremental approach is faster than the two state-of-the-art batch DC discovery approaches that compute from scratch on $r + \triangle r$ by orders of magnitude, even when $\triangle r$ is up to 30% of r.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Fig. 11

BFASTDC: A Bitwise Algorithm for Mining Denial Constraints

Detecting Maximum Inclusion Dependencies without Candidate Generation

An Ideal Fine-Grained GAC Algorithm for Table Constraints

Notes

https://github.com/HPI-Information-Systems/metanome-algorithms/tree/hydra.https://github.com/HPI-Information-Systems/metanome-algorithms/tree/master/dcfinder.
The predicate pairs are not shown because attribute names of Adult and UCE do not have semantic meaning.

References

Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Article Google Scholar
Abedjan, Z., Golab, L., Naumann, F.: Data profiling: a tutorial. In SIGMOD, pp. 1747–1751 (2017)
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. In: Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2018)
Abedjan, Z., Quiané-Ruiz, J. A., Naumann, F.: Detecting unique column combinations on dynamic data. In ICDE, pp. 1036–1047 (2014)
Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13(11), 2270–2283 (2020)
Article Google Scholar
Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with hydra. PVLDB 11(3), 311–323 (2017)
Google Scholar
Caruccio, Loredana: Cirillo, Stefano: incremental discovery of imprecise functional dependencies. ACM J. Data Inf. Qual. 12(4), 19:1-19:25 (2020)
Google Scholar
Caruccio, L., Cirillo, S., Deufemia, V., Polese, G.: Incremental discovery of functional dependencies with a bit-vector algorithm. In SEBD (2019)
Caruccio, L., Deufemia, V., Naumann, F., Polese, G.: Discovering relaxed functional dependencies based on multi-attribute dominance. IEEE Trans. Knowl. Data Eng. 33(9), 3212–3228 (2021)
Article Google Scholar
Caruccio, L., Deufemia, V., Polese, G.: Mining relaxed functional dependencies from data. Data Min. Knowl. Discov. 34(2), 443–477 (2020)
Article MathSciNet MATH Google Scholar
Qi C. Jarek G., Fred K., Cliff Leung, T. T., Linqi Liu, X. Q., and Bernhard Schiefer, K.: Implementation of two semantic query optimization techniques in DB2 universal database. In VLDB, pp. 687–698, (1999)
Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)
Google Scholar
Chu, X., Ilyas, I. F., Papotti, P.: Holistic data cleaning: Putting violations into context. In ICDE, 458–469 (2013)
Gao C., Wenfei F., Floris G., Xibei J., and Shuai M.: Improving data quality: consistency and accuracy. In VLDB, pp. 315–326, 2007
Dallachiesa, Michele, E., Amr, E., Ahmed, E., Ahmed, K., Ilyas, I. F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In SIGMOD, 541–552 (2013)
Fan, W., Geerts, F.: Foundations of Data Quality Management. In Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2012)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1-6:48 (2008)
Article Google Scholar
Fan, W., Chunming, H., Liu, X., Ping, L.: Discovering graph functional dependencies. ACM Trans. Database Syst. 45(3), 151–1542 (2020)
Article MathSciNet Google Scholar
Ge, C., Ilyas, I.F., Kerschbaum, F.: Secure multi-party functional dependency discovery. PVLDB 13(2), 184–196 (2019)
Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with llunatic. VLDB J. 29(4), 867–892 (2020)
Article Google Scholar
Giannakopoulou, S., Karpathiotakis, M., Ailamaki, A.: Cleaning denial constraint violations through relaxation. In SIGMOD, pp. 805–815 (2020)
Gilad, A., Deutch, D., Roy, S.: On multiple semantics for declarative database repairs. In SIGMOD, pp. 817–831 (2020)
Ginsburg, S., Hull, R.: Order dependency in the relational model. Theor. Comput. Sci. 26, 149–195 (1983)
Article MathSciNet MATH Google Scholar
Ginsburg, S., Hull, R.: Sort sets in the relational model. J. ACM 33(3), 465–488 (1986)
Article MathSciNet Google Scholar
Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. PVLDB 7(4), 301–312 (2013)
Google Scholar
Ihab, F.I., Xu, C.: Data Cleaning. ACM, New York City (2019)
MATH Google Scholar
Jin, Y., Tan, Z., Zeng, W., Ma, S.: Approximate order dependency discovery. In ICDE, pp. 25–36 (2021)
Jin, Y., Zhu, L., Tan, Z.: Efficient bidirectional order dependency discovery. In ICDE, pp. 61–72 (2020)
Karegar, R., Godfrey, P., Golab, L., Kargar, M., Srivastava, D., Szlichta, J.: Efficient discovery of approximate order dependencies. In EDBT, pp. 427–432 (2021)
Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J. A., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In SIGMOD, pp. 1215–1230 (2015)
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)
Google Scholar
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Fast and scalable inequality joins. VLDB J. 26(1), 125–150 (2017)
Article Google Scholar
Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB J. 31(1), 1–22 (2022)
Article Google Scholar
Koumarelas, I.K., Naskos, A., Gounaris, A.: Flexible partitioning for selective binary theta-joins in a massively parallel setting. Distributed Parallel Databases 36(2), 301–337 (2018)
Article Google Scholar
Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. PVLDB 11(7), 759–772 (2018)
Google Scholar
Langer, P., Naumann, F.: Efficient order dependency detection. VLDB J. 25(2), 223–241 (2016)
Article Google Scholar
Livshits, E., Heidari, A., Ilyas, I.F., Kimelfeld, B.: Approximate denial constraints. PVLDB 13(10), 1682–1695 (2020)
Google Scholar
Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theort. Comput. Sci. 515, 64–95 (2014)
Article MathSciNet MATH Google Scholar
Nerone, M. A., Holanda, P., de Almeida, E. C., and Manegold, S.: Multidimensional adaptive and progressive indexes. In ICDE, pp. 624–635, 2021
Okcan, A., Riedewald, M.: Processing theta-joins using map reduce. SIGMOD 1(1), 949–960 (2011)
Google Scholar
Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In SIGMOD, pp. 821–833 (2016)
Pena, E. H. M., and de Almeida, E. C. D.: BFASTDC: A bitwise algorithm for mining denial constraints. In DEXA, pp. 53–68, 2018
Pena, E.H.M., de Almeida, E.C.D., Felix, N.: Discovery of approximate (and exact) denial constraints. PVLDB 13(3), 266–278 (2019)
Google Scholar
Pena, E.H.M., de Almeida, E.C., Felix, N.: Fast detection of denial constraint violations. Proc VLDB Endow 15(4), 859–871 (2021)
Article Google Scholar
Pena, E. H. M., Filho, E. R. L., de Almeida, E. C., and Felix N.: Efficient detection of data dependency violations. In CIKM, pp. 1235–1244, (2020)
Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)
Article Google Scholar
Theodoros, R., Xu, C., Ihab, F., Christopher Ré, I.: Holoclean: holistic data repairs with probabilistic inference. Proc VLDB Endow 10(11), 1190–1201 (2017)
Article Google Scholar
Saxena, H., Golab, L., Ilyas, I. F.: Distributed discovery of functional dependencies. In ICDE, pp. 1590–1593 (2019)
Saxena, H., Golab, L., Ilyas, I.F.: Distributed implementations of dependency discovery algorithms. Proc. VLDB Endow 12(11), 1624–1636 (2019)
Article Google Scholar
Schirmer, P., Papenbrock, T., Koumarelas, I.K., Naumann, F.: Efficient discovery of matching dependencies. ACM Trans. Database Syst. 45(3), 13:1-13:33 (2020)
Article MathSciNet Google Scholar
Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: Dynfd: functional dependency discovery in dynamic datasets. In EDBT, pp. 253–264 (2019)
Schmidl, S., Papenbrock, T.: Efficient distributed discovery of bidirectional order dependencies. VLDB J. 31(1), 49–74 (2022)
Article Google Scholar
Shaabani, N., Meinel, C.: Incrementally updating unary inclusion dependencies in dynamic data. Distrib. Parallel Databases 37(1), 133–176 (2019)
Simmen, D. E., Shekita, E. J., Malkemus, T.: Fundamental techniques for order optimization. In SIGMOD, pp. 57–67 (1996)
Song, S., Chen, L.: Discovering matching dependencies. In CIKM, pp. 1421–1424 (2009)
Song, S., Chen, L.: Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87, 146–166 (2013)
Song, S., Gao, F., Huang, R., Wang, C.: Data dependencies extended for variety and veracity: A family tree. IEEE Trans. Knowl. Data Eng. 34(10), 4717–4736 (2022)
Article Google Scholar
Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB 10(7), 721–732 (2017)
Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of bidirectional order dependencies via set-based axioms. VLDB J. 27(4), 573–591 (2018)
Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of order dependencies. PVLDB 5(11), 1220–1231 (2012)
Google Scholar
Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Qiu, W., Zuzarte, C.: Business-intelligence queries with order dependencies in DB2. In EDBT, pp. 750–761 (2014)
Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and complexity of order dependencies. PVLDB 6(14), 1858–1869 (2013)
Google Scholar
Tan, Z., Ran, A., Ma, S., Qin, S.: Fast incremental discovery of pointwise order dependencies. PVLDB 13(10), 1669–1681 (2020)
Google Scholar
Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting inclusion dependencies on very many tables. ACM Trans. Database Syst. 42(3), 18:1-18:29 (2017)
Article MathSciNet Google Scholar
Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)
MATH Google Scholar
Wei, Z., Hartmann, S., Link, S.: Algorithms for the discovery of embedded functional dependencies. VLDB J. 30(6), 1069–1093 (2021)
Article Google Scholar
Wei, Z., Link, S.: Discovery and ranking of functional dependencies. In ICDE, pp. 1526–1537 (2019)
Weise, J., Schmidl, S., Papenbrock, T.: Optimized theta-join processing through candidate pruning and workload distribution. In BTW, pp. 59–78 (2021)
Xiao, R., Tan, Z., Wang, H., Ma, S.: Fast approximate denial constraint discovery. Proc. VLDB Endow. 16(2), 269–281 (2022)
Article Google Scholar
Xiao, R., Yuan, Y., Tan, Z., Ma, S., Wang, W.: Dynamic functional dependency discovery with dynamic hitting set enumeration. In ICDE, pp. 286–298 (2022)
Lin Z., Xu, S., Zijing T., Yang, K., Yang, W., Zhou, X., Tian, Y.: Incremental discovery of order dependencies on tuple insertions. In DASFAA, pp. 157–174 (2019)

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China 62172102 and 61925203. We thank authors of [6, 32, 43] for sharing their codes for our experimental evaluation.

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, China
Chaoqin Qian, Menglu Li, Zijing Tan & Ai Ran
SKLSDE Lab, Beihang University, Beijing, China
Shuai Ma

Authors

Chaoqin Qian
View author publications
You can also search for this author in PubMed Google Scholar
Menglu Li
View author publications
You can also search for this author in PubMed Google Scholar
Zijing Tan
View author publications
You can also search for this author in PubMed Google Scholar
Ai Ran
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zijing Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qian, C., Li, M., Tan, Z. et al. Incremental discovery of denial constraints. The VLDB Journal 32, 1289–1313 (2023). https://doi.org/10.1007/s00778-023-00788-y

Download citation

Received: 24 December 2021
Revised: 25 December 2022
Accepted: 17 February 2023
Published: 17 March 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00778-023-00788-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental discovery of denial constraints

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BFASTDC: A Bitwise Algorithm for Mining Denial Constraints

Detecting Maximum Inclusion Dependencies without Candidate Generation

An Ideal Fine-Grained GAC Algorithm for Table Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Incremental discovery of denial constraints

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BFASTDC: A Bitwise Algorithm for Mining Denial Constraints

Detecting Maximum Inclusion Dependencies without Candidate Generation

An Ideal Fine-Grained GAC Algorithm for Table Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation