Abstract
We investigate the problem of incremental denial constraint (DC) discovery, aiming at discovering DCs in response to a set \(\triangle \)r of tuple insertions to a given relational instance r and the known set \(\varSigma \) of DCs holding on r. The need for the study is evident since real-life data are often frequently updated, and it is often prohibitively expensive to perform DC discovery from scratch for every update. We tackle this problem with two steps. We first employ indexing techniques to efficiently identify the incremental evidences caused by \(\triangle r\). We present algorithms to build indexes for \(\varSigma \) and r in the pre-processing step, and to visit and update indexes in response to \(\triangle \)r. In particular, we propose a novel indexing technique for two inequality comparisons possibly across the attributes of r. By leveraging the indexes, we can identify all the tuple pairs incurred by \(\triangle \)r that simultaneously satisfy the two comparisons, with a cost dependent on log(\(|\)r\(|\)). We then compute the changes \(\triangle \varSigma \) to \(\varSigma \) based on the incremental evidences, such that \(\varSigma \oplus \triangle \varSigma \) is the set of DCs holding on \(r+\triangle r\). \(\triangle \varSigma \) may contain new DCs that are added into \(\varSigma \) and obsolete DCs that are removed from \(\varSigma \). Our experimental evaluations show that our incremental approach is faster than the two state-of-the-art batch DC discovery approaches that compute from scratch on \(r + \triangle r\) by orders of magnitude, even when \(\triangle r\) is up to 30% of r.

















Similar content being viewed by others
Notes
The predicate pairs are not shown because attribute names of Adult and UCE do not have semantic meaning.
References
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Abedjan, Z., Golab, L., Naumann, F.: Data profiling: a tutorial. In SIGMOD, pp. 1747–1751 (2017)
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. In: Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2018)
Abedjan, Z., Quiané-Ruiz, J. A., Naumann, F.: Detecting unique column combinations on dynamic data. In ICDE, pp. 1036–1047 (2014)
Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13(11), 2270–2283 (2020)
Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with hydra. PVLDB 11(3), 311–323 (2017)
Caruccio, Loredana: Cirillo, Stefano: incremental discovery of imprecise functional dependencies. ACM J. Data Inf. Qual. 12(4), 19:1-19:25 (2020)
Caruccio, L., Cirillo, S., Deufemia, V., Polese, G.: Incremental discovery of functional dependencies with a bit-vector algorithm. In SEBD (2019)
Caruccio, L., Deufemia, V., Naumann, F., Polese, G.: Discovering relaxed functional dependencies based on multi-attribute dominance. IEEE Trans. Knowl. Data Eng. 33(9), 3212–3228 (2021)
Caruccio, L., Deufemia, V., Polese, G.: Mining relaxed functional dependencies from data. Data Min. Knowl. Discov. 34(2), 443–477 (2020)
Qi C. Jarek G., Fred K., Cliff Leung, T. T., Linqi Liu, X. Q., and Bernhard Schiefer, K.: Implementation of two semantic query optimization techniques in DB2 universal database. In VLDB, pp. 687–698, (1999)
Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)
Chu, X., Ilyas, I. F., Papotti, P.: Holistic data cleaning: Putting violations into context. In ICDE, 458–469 (2013)
Gao C., Wenfei F., Floris G., Xibei J., and Shuai M.: Improving data quality: consistency and accuracy. In VLDB, pp. 315–326, 2007
Dallachiesa, Michele, E., Amr, E., Ahmed, E., Ahmed, K., Ilyas, I. F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In SIGMOD, 541–552 (2013)
Fan, W., Geerts, F.: Foundations of Data Quality Management. In Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2012)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1-6:48 (2008)
Fan, W., Chunming, H., Liu, X., Ping, L.: Discovering graph functional dependencies. ACM Trans. Database Syst. 45(3), 151–1542 (2020)
Ge, C., Ilyas, I.F., Kerschbaum, F.: Secure multi-party functional dependency discovery. PVLDB 13(2), 184–196 (2019)
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with llunatic. VLDB J. 29(4), 867–892 (2020)
Giannakopoulou, S., Karpathiotakis, M., Ailamaki, A.: Cleaning denial constraint violations through relaxation. In SIGMOD, pp. 805–815 (2020)
Gilad, A., Deutch, D., Roy, S.: On multiple semantics for declarative database repairs. In SIGMOD, pp. 817–831 (2020)
Ginsburg, S., Hull, R.: Order dependency in the relational model. Theor. Comput. Sci. 26, 149–195 (1983)
Ginsburg, S., Hull, R.: Sort sets in the relational model. J. ACM 33(3), 465–488 (1986)
Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. PVLDB 7(4), 301–312 (2013)
Ihab, F.I., Xu, C.: Data Cleaning. ACM, New York City (2019)
Jin, Y., Tan, Z., Zeng, W., Ma, S.: Approximate order dependency discovery. In ICDE, pp. 25–36 (2021)
Jin, Y., Zhu, L., Tan, Z.: Efficient bidirectional order dependency discovery. In ICDE, pp. 61–72 (2020)
Karegar, R., Godfrey, P., Golab, L., Kargar, M., Srivastava, D., Szlichta, J.: Efficient discovery of approximate order dependencies. In EDBT, pp. 427–432 (2021)
Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J. A., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In SIGMOD, pp. 1215–1230 (2015)
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Fast and scalable inequality joins. VLDB J. 26(1), 125–150 (2017)
Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB J. 31(1), 1–22 (2022)
Koumarelas, I.K., Naskos, A., Gounaris, A.: Flexible partitioning for selective binary theta-joins in a massively parallel setting. Distributed Parallel Databases 36(2), 301–337 (2018)
Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. PVLDB 11(7), 759–772 (2018)
Langer, P., Naumann, F.: Efficient order dependency detection. VLDB J. 25(2), 223–241 (2016)
Livshits, E., Heidari, A., Ilyas, I.F., Kimelfeld, B.: Approximate denial constraints. PVLDB 13(10), 1682–1695 (2020)
Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theort. Comput. Sci. 515, 64–95 (2014)
Nerone, M. A., Holanda, P., de Almeida, E. C., and Manegold, S.: Multidimensional adaptive and progressive indexes. In ICDE, pp. 624–635, 2021
Okcan, A., Riedewald, M.: Processing theta-joins using map reduce. SIGMOD 1(1), 949–960 (2011)
Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In SIGMOD, pp. 821–833 (2016)
Pena, E. H. M., and de Almeida, E. C. D.: BFASTDC: A bitwise algorithm for mining denial constraints. In DEXA, pp. 53–68, 2018
Pena, E.H.M., de Almeida, E.C.D., Felix, N.: Discovery of approximate (and exact) denial constraints. PVLDB 13(3), 266–278 (2019)
Pena, E.H.M., de Almeida, E.C., Felix, N.: Fast detection of denial constraint violations. Proc VLDB Endow 15(4), 859–871 (2021)
Pena, E. H. M., Filho, E. R. L., de Almeida, E. C., and Felix N.: Efficient detection of data dependency violations. In CIKM, pp. 1235–1244, (2020)
Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)
Theodoros, R., Xu, C., Ihab, F., Christopher Ré, I.: Holoclean: holistic data repairs with probabilistic inference. Proc VLDB Endow 10(11), 1190–1201 (2017)
Saxena, H., Golab, L., Ilyas, I. F.: Distributed discovery of functional dependencies. In ICDE, pp. 1590–1593 (2019)
Saxena, H., Golab, L., Ilyas, I.F.: Distributed implementations of dependency discovery algorithms. Proc. VLDB Endow 12(11), 1624–1636 (2019)
Schirmer, P., Papenbrock, T., Koumarelas, I.K., Naumann, F.: Efficient discovery of matching dependencies. ACM Trans. Database Syst. 45(3), 13:1-13:33 (2020)
Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: Dynfd: functional dependency discovery in dynamic datasets. In EDBT, pp. 253–264 (2019)
Schmidl, S., Papenbrock, T.: Efficient distributed discovery of bidirectional order dependencies. VLDB J. 31(1), 49–74 (2022)
Shaabani, N., Meinel, C.: Incrementally updating unary inclusion dependencies in dynamic data. Distrib. Parallel Databases 37(1), 133–176 (2019)
Simmen, D. E., Shekita, E. J., Malkemus, T.: Fundamental techniques for order optimization. In SIGMOD, pp. 57–67 (1996)
Song, S., Chen, L.: Discovering matching dependencies. In CIKM, pp. 1421–1424 (2009)
Song, S., Chen, L.: Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87, 146–166 (2013)
Song, S., Gao, F., Huang, R., Wang, C.: Data dependencies extended for variety and veracity: A family tree. IEEE Trans. Knowl. Data Eng. 34(10), 4717–4736 (2022)
Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB 10(7), 721–732 (2017)
Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of bidirectional order dependencies via set-based axioms. VLDB J. 27(4), 573–591 (2018)
Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of order dependencies. PVLDB 5(11), 1220–1231 (2012)
Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Qiu, W., Zuzarte, C.: Business-intelligence queries with order dependencies in DB2. In EDBT, pp. 750–761 (2014)
Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and complexity of order dependencies. PVLDB 6(14), 1858–1869 (2013)
Tan, Z., Ran, A., Ma, S., Qin, S.: Fast incremental discovery of pointwise order dependencies. PVLDB 13(10), 1669–1681 (2020)
Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting inclusion dependencies on very many tables. ACM Trans. Database Syst. 42(3), 18:1-18:29 (2017)
Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)
Wei, Z., Hartmann, S., Link, S.: Algorithms for the discovery of embedded functional dependencies. VLDB J. 30(6), 1069–1093 (2021)
Wei, Z., Link, S.: Discovery and ranking of functional dependencies. In ICDE, pp. 1526–1537 (2019)
Weise, J., Schmidl, S., Papenbrock, T.: Optimized theta-join processing through candidate pruning and workload distribution. In BTW, pp. 59–78 (2021)
Xiao, R., Tan, Z., Wang, H., Ma, S.: Fast approximate denial constraint discovery. Proc. VLDB Endow. 16(2), 269–281 (2022)
Xiao, R., Yuan, Y., Tan, Z., Ma, S., Wang, W.: Dynamic functional dependency discovery with dynamic hitting set enumeration. In ICDE, pp. 286–298 (2022)
Lin Z., Xu, S., Zijing T., Yang, K., Yang, W., Zhou, X., Tian, Y.: Incremental discovery of order dependencies on tuple insertions. In DASFAA, pp. 157–174 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qian, C., Li, M., Tan, Z. et al. Incremental discovery of denial constraints. The VLDB Journal 32, 1289–1313 (2023). https://doi.org/10.1007/s00778-023-00788-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-023-00788-y