Skip to main content
Log in

Consistent Query Answering for Primary Keys in Datalog

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

We study the complexity of consistent query answering on databases that may violate primary key constraints. A repair of such a database is any consistent database that can be obtained by deleting a minimal set of tuples. For every Boolean query q, CERTAINTY(q) is the problem that takes a database as input and asks whether q evaluates to true on every repair. In Koutris and Wijsen (ACM Trans. Database Syst. 42(2), 9:1–9:45, 2017), the authors show that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either in P or coNP-complete, and it is decidable which of the two cases applies. In this article, we sharpen this result by showing that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either expressible in symmetric stratified Datalog (with some aggregation operator) or coNP-complete. Since symmetric stratified Datalog is in L, we thus obtain a complexity-theoretic dichotomy between L and coNP-complete. Another new finding of practical importance is that CERTAINTY(q) is on the logspace side of the dichotomy for queries q where all join conditions express foreign-to-primary key matches, which is undoubtedly the most common type of join condition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The quotient graph of a directed graph G = (V,E) with respect to an equivalence relation ≡ on V is a directed graph whose vertices are the equivalence classes of ≡; there is a directed edge from class A to class B if E has a directed edge from some vertex in A to some vertex in B.

  2. Here, α[Z ∪{w}] is the restriction of α to Z ∪{w}.

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995). http://webdam.inria.fr/Alice/

    MATH  Google Scholar 

  2. Arenas, M., Bertossi, L. E., Chomicki, J.: Consistent query answers in inconsistent databases. In: ACM PODS, pp. 68–79. https://doi.org/10.1145/303976.303983 (1999)

  3. Arenas, M., Bertossi, L. E., Chomicki, J., He, X., Raghavan, V., Spinrad, J. P.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003). https://doi.org/10.1016/S0304-3975(02)00737-5

    Article  MathSciNet  MATH  Google Scholar 

  4. Aspvall, B., Plass, M. F., Tarjan, R. E.: A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett. 8 (3), 121–123 (1979). https://doi.org/10.1016/0020-0190(79)90002-4

    Article  MathSciNet  MATH  Google Scholar 

  5. Baader, F., Horrocks, I., Lutz, C., Sattler, U.: An introduction to description logic. Cambridge University Press, Cambridge (2017). http://www.cambridge.org/de/academic/subjects/computer-science/knowledge-management-databases-and-data-mining/introduction-description-logic?format=PB#17zVGeWD2TZUeu6s.97

    Book  Google Scholar 

  6. Barceló, P., Fontaine, G.: On the data complexity of consistent query answering over graph databases. J. Comput. Syst. Sci. 88, 164–194 (2017). https://doi.org/10.1016/j.jcss.2017.03.015

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertossi, L. E.: Database repairing and consistent query answering. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael (2011)

    Google Scholar 

  8. Bertossi, L. E.: Database repairs and consistent query answering: Origins and further developments. In: Suciu, D., Skritek, S., Koch, C. (eds.) Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. https://doi.org/10.1145/3294052.3322190, pp 48–58. ACM (2019)

  9. Bienvenu, M., Bourgaux, C.: Inconsistency-tolerant querying of description logic knowledge bases. In: Pan, J.Z., Calvanese, D., Eiter, T., Horrocks, I., Kifer, M., Lin, F., Zhao, Y. (eds.) Reasoning Web: Logical foundation of knowledge graph construction and query answering - 12th International Summer School 2016, Aberdeen, UK, September 5-9, 2016, Tutorial lectures, Lecture notes in computer science. https://doi.org/10.1007/978-3-319-49493-7_5, vol. 9885, pp 156–202. Springer (2016)

  10. Bulatov, A. A.: Complexity of conservative constraint satisfaction problems. ACM Trans. Comput. Log. 12(4), 24:1–24:66 (2011). https://doi.org/10.1145/1970398.1970400

    Article  MathSciNet  MATH  Google Scholar 

  11. Dixit, A. A., Kolaitis, P. G.: A SAT-based system for consistent query answering. In: Janota, M., Lynce, I. (eds.) Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proceedings, Lecture Notes in Computer Science, vol. 11628, pp 117–135. Springer (2019), https://doi.org/10.1007/978-3-030-24258-9_8

  12. Egri, L., Larose, B., Tesson, P.: Symmetric Datalog and constraint satisfaction problems in Logspace. In: LICS, pp. 193–202. https://doi.org/10.1109/LICS.2007.47 (2007)

  13. Fontaine, G.: Why is it hard to obtain a dichotomy for consistent query answering? ACM Trans. Comput. Log. 16 (1), 7:1–7:24 (2015). https://doi.org/10.1145/2699912

    Article  MathSciNet  MATH  Google Scholar 

  14. Fuxman, A., Miller, R. J.: First-order query rewriting for inconsistent databases. In: ICDT, pp 337–351 (2005), https://doi.org/10.1007/978-3-540-30570-5_23

  15. Fuxman, A., Miller, R. J.: First-order query rewriting for inconsistent databases. J. Comput. Syst. Sci. 73(4), 610–635 (2007). https://doi.org/10.1016/j.jcss.2006.10.013

    Article  MathSciNet  MATH  Google Scholar 

  16. Grädel, E., Kolaitis, P. G., Libkin, L., Marx, M., Spencer, J., Vardi, M. Y., Venema, Y., Weinstein, S.: Finite model theory and its applications. Texts in theoretical computer science. An EATCS series springer. https://doi.org/10.1007/3-540-68804-8 (2007)

  17. Greco, S., Pijcke, F., Wijsen, J.: Certain query answering in partially consistent databases. PVLDB 7(5), 353–364 (2014). http://www.vldb.org/pvldb/vol7/p353-greco.pdf

    Google Scholar 

  18. Grohe, M., Schwentick, T.: Locality of order-invariant first-order formulas. ACM Trans. Comput. Log. 1(1), 112–130 (2000). https://doi.org/10.1145/343369.343386

    Article  MathSciNet  MATH  Google Scholar 

  19. Kolaitis, P.G., Pema, E., Tan, W.: Efficient querying of inconsistent databases with binary integer programming. PVLDB 6(6), 397–408 (2013). http://www.vldb.org/pvldb/vol6/p397-tan.pdf

    Google Scholar 

  20. Koutris, P., Wijsen, J.: The data complexity of consistent query answering for self-join-free conjunctive queries under primary key constraints. In: PODS. https://doi.org/10.1145/2745754.2745769, pp 17–29 (2015)

  21. Koutris, P., Wijsen, J.: Consistent query answering for self-join-free conjunctive queries under primary key constraints. ACM Trans. Database Syst. 42 (2), 9:1–9:45 (2017). https://doi.org/10.1145/3068334

    Article  MathSciNet  Google Scholar 

  22. Koutris, P., Wijsen, J.: Consistent query answering for primary keys and conjunctive queries with negated atoms. In: PODS, pp 209–224 (2018), https://doi.org/10.1145/3196959.3196982

  23. Koutris, P., Wijsen, J.: Consistent query answering for primary keys in logspace. In: Barceló, P., Calautti, M. (eds.) 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, vol. 127, pp 23:1–23:19 (2019), https://doi.org/10.4230/LIPIcs.ICDT.2019.23

  24. Lembo, D., Lenzerini, M., Rosati, R., Ruzzi, M., Savo, D. F.: Inconsistency-tolerant query answering in ontology-based data access. J. Web Sem. 33, 3–29 (2015). https://doi.org/10.1016/j.websem.2015.04.002

    Article  Google Scholar 

  25. Libkin, L.: Elements of finite model theory. Texts in theoretical computer science. An EATCS series springer. https://doi.org/10.1007/978-3-662-07003-1 (2004)

  26. Lincoln, A., Williams, V. V., Williams, R. R.: Tight hardness for shortest cycles and paths in sparse graphs. In: ACM-SIAM SODA. https://doi.org/10.1137/1.9781611975031.80, pp 1236–1252 (2018)

  27. Lutz, C., Wolter, F.: On the relationship between consistent query answering and constraint satisfaction problems. In: ICDT. https://doi.org/10.4230/LIPIcs.ICDT.2015.363, pp 363–379 (2015)

  28. Marileo, M. C., Bertossi, L. E.: The consistency extractor system: Answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010). https://doi.org/10.1016/j.datak.2010.01.005

    Article  Google Scholar 

  29. Maslowski, D., Wijsen, J.: A dichotomy in the complexity of counting database repairs. J. Comput. Syst. Sci. 79(6), 958–983 (2013). https://doi.org/10.1016/j.jcss.2013.01.011

    Article  MathSciNet  MATH  Google Scholar 

  30. Maslowski, D., Wijsen, J.: Counting database repairs that satisfy conjunctive queries with self-joins. In: ICDT, pp 155–164 (2014), https://doi.org/10.5441/002/icdt.2014.18

  31. Pijcke, F.: Theoretical and practical methods for consistent query answering in the relational data model. Ph.D. thesis, University of Mons (2018)

  32. Przymus, P., Boniewicz, A., Burzanska, M., Stencel, K.: Recursive query facilities in relational databases: a survey. In: FGIT. https://doi.org/10.1007/978-3-642-17622-7_10, pp 89–99 (2010)

  33. Reingold, O.: Undirected connectivity in log-space. J. ACM 55 (4), 17:1–17:24 (2008). https://doi.org/10.1145/1391289.1391291

    Article  MathSciNet  MATH  Google Scholar 

  34. Wijsen, J.: On the First-order expressibility of computing certain answers to conjunctive queries over uncertain databases. In: PODS. https://doi.org/10.1145/1807085.1807111, pp 179–190 (2010)

  35. Wijsen, J.: Certain conjunctive query answering in first-order logic. ACM Trans. Database Syst. 37(2), 9:1–9:35 (2012). https://doi.org/10.1145/2188349.2188351

    Article  MathSciNet  Google Scholar 

  36. Wijsen, J.: A survey of the data complexity of consistent query answering under key constraints. In: FoIKS. https://doi.org/10.1007/978-3-319-04939-7_2, pp 62–78 (2014)

  37. Wijsen, J.: Foundations of query answering on inconsistent databases. SIGMOD Rec. 48(3), 6–16 (2019). https://doi.org/10.1145/3377391.3377393

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jef Wijsen.

Additional information

E : Proofs of Section 9

We will use the following helping lemma.

Lemma 19

Let q be a query in sjfBCQ that has the key-join property. Then, for all F,Gq, if \(F\overset {q}{\rightsquigarrow }G\), then there exists a sequence \(F_{0},F_{1},\dots ,F_{\ell }\) such that F0 = F, F = G, and for all \(i\in \{1,2,\dots ,\ell \}\), \({\mathsf {key}}({F_{i}})\subseteq {\mathsf {vars}}({F_{i-1}})\).

Proof

Assume \(F\overset {q}{\rightsquigarrow }G\). We can assume a shortest sequence

$$ F_{0}\stackrel{x_{1}}{\smallfrown}F_{1}\stackrel{x_{2}}{\smallfrown}F_{2}\dotsm\stackrel{x_{\ell-1}}{\smallfrown}F_{\ell-1}\stackrel{x_{\ell}}{\smallfrown}F_{\ell} $$
(7)

that is a witness for \(F\overset {q}{\rightsquigarrow }G\). Clearly, for all \(i\in \{0,1,\dots ,\ell -1\}\), vars(Fi) ∩vars(Fi+ 1)≠. Then, since q has the key-join property, for all \(i\in \{0,1,\dots ,\ell -1\}\), either

  1. 1.

    vars(Fi) ∩vars(Fi+ 1) ∈{key(Fi),key(Fi+ 1)}, or

  2. 2.

    \({\mathsf {vars}}({F_{i}})\cap {\mathsf {vars}}({F_{i+1}})\supseteq {\mathsf {key}}({F_{i}})\cup {\mathsf {key}}({F_{i+1}})\).

We show by induction on increasing i that for all \(i\in \{1,\dots ,\ell \}\), \({\mathsf {key}}({F_{i}})\subseteq {\mathsf {vars}}({F_{i-1}})\).Induction Basis i = 1 From \(x_{1}\notin {F_{0}}^{+,{q}}\), it follows x1key(F0). It follows that vars(F0) ∩vars(F1)≠key(F0). Consequently, vars(F0) ∩vars(F1) includes key(F1).Induction Step \(i\rightarrow i+1\) The induction hypothesis is that \({\mathsf {key}}({F_{i}})\subseteq {\mathsf {vars}}({F_{i-1}})\). Assume, towards a contradiction, vars(Fi) ∩vars(Fi+ 1) = key(Fi). It follows xi+ 1vars(Fi− 1). Then the witness (7) can be shortened by replacing the subsequence \(F_{i-1}\stackrel {x_{i}}{\smallfrown }F_{i}\stackrel {x_{i+1}}{\smallfrown }F_{i+1}\) with \(F_{i-1}\stackrel {x_{i+1}}{\smallfrown }F_{i+1}\), contradicting our assumption that no witness for \(F\overset {q}{\rightsquigarrow }G\) is shorter than (7). We conclude by contradiction that vars(Fi) ∩vars(Fi+ 1)≠key(Fi). Consequently, vars(Fi) ∩vars(Fi+ 1) includes key(Fi+ 1). □

The proof of Theorem 4 can now be given.

Proof Proof of Theorem 4

Assume that q has the key-join property We show that the attack graph of q contains no strong attacks. To this end, assume \(F\stackrel {q}{\rightsquigarrow }G\). The sequence \(F_{0},F_{1},\dots ,F_{\ell -1}\) in the statement of Lemma 19 is a sequential proof for \({\mathcal {K}}({q})\models {{\mathsf {key}}({F_{0}})}\rightarrow {{\mathsf {key}}({F_{\ell }})}\), and therefore the attack \(F\overset {q}{\rightsquigarrow }G\) is weak. The result then follows from Theorem 3. □

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Database Theory (ICDT 2019)

Guest Editor: Pablo Baceló

This article extends an earlier, shorter version entitled “Consistent Query Answering for Primary Keys in Logspace” which was presented at the 22nd International Conference on Database Theory (ICDT 2019) [23] .

Appendices

Appendix A: Overview of Different Graphs and Notations

Graph

Vertices

Edge Notation

Short Description

attack graph

query atoms

\(F\overset {q}{\rightsquigarrow }G\)

See Section 3. Informally, \(F\overset {q}{\rightsquigarrow }G\) means that there exists a “yes”-instance of CERTAINTY(q) in which two key-equal F-facts join with (and only with) two G-facts that are not key-equal (cf. [35, Proposition 6.4]).

M-graph

query atoms

FMG

Definition 3. Informally, FMG states that the functional dependency \({{\mathsf {vars}}({F})}\rightarrow {{\mathsf {key}}({G})}\) is a logical consequence of the primary keys in atoms of mode c.

↪-graph

database facts

AB

Definition 4, data-level instantiation of the M-graph

C-graph

database facts

ACB

Definition 5, subgraph of the ↪-graph induced by an M-cycle C

block-quotient graph

database blocks

\(({\mathbf {b}},{\mathbf {b}}^{\prime })\)

Definition 6, quotient graph of the ↪ C-graph relative to the equivalence relation “is key-equal to”

Notation

Meaning

key(F)

the set of all variables occurring in the primary key of atom F

vars(F)

the set of all variables occurring in atom F

vars(q)

the set of all variables occurring in query q

\(\sim \)

the equivalence relation “is key-equal to”, e.g., \(R(\underline {a},1)\sim R(\underline {a},2)\)

rset(db)

the set of all repairs of a database db

block(A,db)

the set of all facts in db that are key-equal to the fact A

\(R(\underline {\vec {a}},\ast )\)

the set of all database facts of the form \(R(\underline {\vec {a}},\vec {b})\), for some \(\vec {b}\)

s j f B C Q

the class of self-join-free Boolean conjunctive queries

U C Q

the class of unions of conjunctive queries

R c

a relation name of mode c, which must be interpreted by a consistent relation

q cons

the set of all atoms of query q having a relation name of mode c

\({\mathcal {K}}({q})\)

the set containing \({{\mathsf {key}}({F})}\rightarrow {{\mathsf {vars}}({F})}\) for every Fq

F +,q

the closure of key(F) with respect to the FDs in \({\mathcal {K}}({q\setminus \{F\}})\cup {\mathcal {K}}({{q}^{\mathsf {cons}}})\)

genreq(A)

the atom of q with the same relation name as the fact A

V (G)

the vertex set of a graph G

E(G)

the edge set of a graph G

a set union that happens to be disjoint

Appendix B: Proofs of Section 5

1.1 B.1 Proofs of Lemmas 1 and 2

Proof Proof of Lemma 1

Let o1 and o2 be garbage sets for q0 in db. For every i ∈{1, 2}, we can assume a repair ri of oi such that

Garbage Condition: for every valuation 𝜃 over vars(q) such that \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{i}})\cup {\mathbf {r}}_{i}\), we have 𝜃(q0) ∩ri = .

Let \({\mathbf {o}}_{2}^{-} = {\mathbf {o}}_{2}\setminus {\mathbf {o}}_{1}\) and \({\mathbf {r}}_{2}^{-} = {\mathbf {r}}_{2}\setminus {\mathbf {o}}_{1}\). Then, \({\mathbf {r}}_{1}\uplus {\mathbf {r}}_{2}^{-}\) is a repair of \({\mathbf {o}}_{1}\uplus {\mathbf {o}}_{2}^{-}\), where the use of ⊎ (rather than ∪) indicates that the operands of the union are disjoint. Let 𝜃 be an arbitrary valuation over vars(q) such that

$$\theta(q)\subseteq\left({\mathbf{db}\setminus({{\mathbf{o}}_{1}\uplus{\mathbf{o}}_{2}^{-}})}\right)\cup({{\mathbf{r}}_{1}\uplus{\mathbf{r}}_{2}^{-}}).$$

Then, \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{1}})\cup {\mathbf {r}}_{1}\). Consequently, by the Garbage Condition for i = 1, 𝜃(q0) ∩r1 = , and therefore 𝜃(q0) ∩o1 = . It follows \(\theta (q)\subseteq \left ({\mathbf {db}\setminus ({{\mathbf {o}}_{1}\cup {\mathbf {o}}_{2}})}\right )\cup {\mathbf {r}}_{2}^{-}\), hence \(\theta (q)\subseteq \left ({\mathbf {db}\setminus {\mathbf {o}}_{2}}\right )\cup {\mathbf {r}}_{2}^{-}\). Consequently, by the Garbage Condition for i = 2, \(\theta (q_{0})\cap {\mathbf {r}}_{2}^{-}=\emptyset \). It follows that \({\mathbf {o}}_{1}\uplus {\mathbf {o}}_{2}^{-}\)=o1o2 is a garbage set for q0 in db. □

Proof Proof of Lemma 2

The ⇐=-direction is trivial. For the ⇒-direction, assume that every repair of db satisfies q. We can assume a repair r0 of o such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}_{0}\), then 𝜃(q0) ∩r0 = . Let r be an arbitrary repair of dbo. It suffices to show rq. Since rr0 is a repair of db, we can assume a valuation 𝜃 over vars(q) such that \(\theta (q)\subseteq {\mathbf {r}}\cup {\mathbf {r}}_{0}\). Since \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}_{0}\) is obvious, it follows 𝜃(q) ∩r0 = . Consequently, \(\theta (q)\subseteq {\mathbf {r}}\), hence rq. This concludes the proof. □

1.2 B.2 Proof of Lemma 3

We will use two helping lemmas.

Lemma 13

Let q be a query in sjfBCQ, and let \(q_{0}\subseteq q\). Let o be a garbage set for q0 in db. If p is the union of one or more blocks of o, then op is a garbage set for q0 in dbp.

Proof

Let p be the union of one or more blocks of o. We can assume a repair r of o such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), then 𝜃(q) ∩r = . Let s = rp. Obviously, s is a repair of op.

Let 𝜃 be a valuation over vars(q) such that \(\theta (q)\subseteq \left ({({\mathbf {db}\setminus {\mathbf {p}}})\setminus ({{\mathbf {o}}\setminus {\mathbf {p}}})}\right )\cup {\mathbf {s}}\). It suffices to show 𝜃(q) ∩s = . Since \(\left ({\mathbf {db}\setminus {\mathbf {p}}}\right )\setminus \left ({{\mathbf {o}}\setminus {\mathbf {p}}}\right )\subseteq \mathbf {db}\setminus {\mathbf {o}}\) and \({\mathbf {s}}\subseteq {\mathbf {r}}\), it follows \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), hence 𝜃(q) ∩r = . It follows 𝜃(q) ∩s = . □

Corollary 1

Let q be a query in sjfBCQ, and let \(q_{0}\subseteq q\). Let o be a garbage set for q0 in db. If every garbage set for q0 in dbo is empty, then o is the maximum garbage set for q0 in db.

Proof

Proof by contraposition. Assume that o is not the maximum garbage set for q0 in db. Let o0 be the maximum garbage set for q0 in db. By Lemma 13, o0o is a nonempty garbage set for q0 in dbo. □

Lemma 14

Let q be a query in sjfBCQ, and let \(q_{0}\subseteq q\). Let db be a database. If o is a garbage set for q0 in db, and p is a garbage set for q0 in dbo, then op is a garbage set for q0 in db.

Proof

Assume the hypothesis holds. Note that op = . We can assume a repair r of o such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), then 𝜃(q) ∩r = . Likewise, we can assume a repair s of p such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq \left ({({\mathbf {db}\setminus {\mathbf {o}}})\setminus {\mathbf {p}}}\right )\cup {\mathbf {s}}\), then 𝜃(q) ∩s = . Obviously, rs is a repair of op.

Let 𝜃 be a valuation over vars(q) such that \(\theta (q)\subseteq \left ({\mathbf {db}\setminus ({{\mathbf {o}}\cup {\mathbf {p}}})}\right )\cup ({{\mathbf {r}}\cup {\mathbf {s}}})\). From the set inclusion \(\left ({\mathbf {db}\setminus ({{\mathbf {o}}\cup {\mathbf {p}}})}\right )\cup ({{\mathbf {r}}\cup {\mathbf {s}}}) \subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), it follows \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), hence 𝜃(q) ∩r = . Then, \(\theta (q)\subseteq \left ({\mathbf {db}\setminus ({{\mathbf {o}}\cup {\mathbf {p}}})}\right )\cup {\mathbf {s}} = \left ({({\mathbf {db}\setminus {\mathbf {o}}})\setminus {\mathbf {p}}}\right )\cup {\mathbf {s}}\), hence 𝜃(q) ∩s = . It follows 𝜃(q) ∩ (rs) = . □

Corollary 2

Let q be a query in sjfBCQ, and let \(q_{0}\subseteq q\). Let db be a database, and let o be the maximum garbage set for q0 in db. Then, every garbage set for q0 in dbo is empty.

Proof

Immediate from Lemma 14. □

The proof of Lemma 3 can now be given.

Proof Proof of Lemma 3

Immediate from Corollaries 1 and 2. □

Appendix C: Appendix to Section 7

1.1 C.1 Proofs of Lemmas 5 and 6

Proof Proof of Lemma 5

We will write ⊕ for addition modulo k. We first consider garbage sets respecting the first three conditions.

  • Let A be a fact of db such that \({\mathsf {genre}}_{q}({A})\in \{F_{0},\dots ,F_{k-1}\}\) and A has zero outdegree in the ↪ C-graph. Then, there exists no valuation 𝜃 over vars(q) such that \(A\in \theta (q)\subseteq \mathbf {db}\). It is obvious that block(A,db) is a garbage set for C in db.

  • Let \(A_{0}\stackrel {{~}_{C}}{\hookrightarrow }A_{1}\stackrel {{~}_{C}}{\hookrightarrow }\dotsm \stackrel {{~}_{C}}{\hookrightarrow }A_{k-1}\stackrel {{~}_{C}}{\hookrightarrow }A_{0}\) be an irrelevant 1-embedding of C in db. Assume without loss of generality that for every \(i\in \{0,\dots ,k-1\}\), genreq(Ai) = Fi. Let \({\mathbf {o}}=\bigcup _{i=0}^{k-1}{\mathsf {block}}({A_{i}},{\mathbf {db}})\). Let \({\mathbf {r}}=\{A_{0},\dots ,A_{k-1}\}\), which is obviously a repair of o. We show that o is a garbage set for C in db. Assume, toward a contradiction, the existence of a valuation 𝜃 over vars(q) such that for some \(i\in \{0,\dots ,k-1\}\), \(A_{i}\in \theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}})\cup {\mathbf {r}}\). Then, 𝜃(Fi)↪ C𝜃(Fi⊕1). Since 𝜃(Fi) = Ai, we have AiC𝜃(Fi⊕1). From AiC𝜃(Fi⊕1) and AiCAi⊕1, it follows \(\theta (F_{i\oplus 1})\sim A_{i\oplus 1}\) by Lemma 4. Since 𝜃(Fi⊕1) ∈ (dbo) ∪r, it follows 𝜃(Fi⊕1) = Ai⊕1. By repeated application of the same reasoning, for every \(j\in \{0,\dots ,k-1\}\), 𝜃(Fj) = Aj. But then \(A_{0}\stackrel {{~}_{C}}{\hookrightarrow }A_{1}\stackrel {{~}_{C}}{\hookrightarrow }\dotsm \stackrel {{~}_{C}}{\hookrightarrow }A_{k-1}\stackrel {{~}_{C}}{\hookrightarrow }A_{0}\) is a relevant 1-embedding of C in db, a contradiction.

  • Let r be a set containing all (and only) the facts of some n-embedding of C in db with n ≥ 2. Let \({\mathbf {o}}=\bigcup _{A\in {\mathbf {r}}}{\mathsf {block}}({A},{\mathbf {db}})\). It can be shown that o is a garbage set for C in db; the argumentation is analogous to the reasoning in the previous paragraph.

Let o0 be the minimal subset of db that satisfies all conditions in the statement of the lemma except the recursive Condition 4. By Lemma 1 and our reasoning in the previous items, it follows that o0 is a garbage set for C in db.

Note that the first three conditions do not recursively depend on o0. Starting with o0, construct a maximal sequence

$${\mathbf{o}}_{0},\mu_{0},{\mathbf{o}}_{1},\mu_{1},{\mathbf{o}}_{2},\mu_{2},\dots,{\mathbf{o}}_{m},\mu_{m},{\mathbf{o}}_{m+1}$$

such that \({\mathbf {o}}_{0}\subsetneq {\mathbf {o}}_{1}\subsetneq {\mathbf {o}}_{2}\subsetneq \dotsm \subsetneq {\mathbf {o}}_{m+1}\) and for every \(h\in \{0,1,\dots ,m\}\),

  1. 1.

    μh is a valuation over vars(q) such that \(\mu _{h}(q)\subseteq \mathbf {db}\) and μh(q) ∩oh. Therefore, \(\mu (F_{0})\stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{1})\stackrel {{~}_{C}}{\hookrightarrow }\dotsm \stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{k-1})\stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{0})\) is a relevant 1-embedding of C in db; and

  2. 2.

    \({\mathbf {o}}_{h+1}={\mathbf {o}}_{h}\cup \left ({\bigcup _{i=0}^{k-1}{\mathsf {block}}({\mu _{h}(F_{i})},{\mathbf {db}})}\right )\).

It is clear that the final set om+ 1 is a minimal set satisfying all conditions in the statement of the lemma. We show by induction on increasing h that for all \(h\in \{0,1,\dots ,m,m+1\}\), oh is a garbage set for C in db. We have already showed that o0 is a garbage set for C in db. For the induction step, \(h\rightarrow h+1\), the induction hypothesis is that oh is a garbage set for C in db. Then, there exists a repair r of oh such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{h}})\cup {\mathbf {r}}\), then 𝜃(q) ∩r = . For every \(i\in \{0,\dots ,k-1\}\), define Ai := μh(Fi). Let \({\mathbf {s}}=\{A_{0},\dots ,A_{k-1}\}\setminus {\mathbf {o}}_{h}\). We have \({\mathbf {o}}_{h+1}={\mathbf {o}}_{h}\uplus \left ({\bigcup _{A_{j}\in {\mathbf {s}}}{\mathsf {block}}({A_{j}},{\mathbf {db}})}\right )\). Let \({\mathbf {r}}^{\prime }={\mathbf {r}}\uplus {\mathbf {s}}\). Obviously, \({\mathbf {r}}^{\prime }\) is a repair of oh+ 1. Here, we use ⊎, rather than ∪, to make clear that the operands of the union are disjoint. Assume, toward a contradiction, the existence of a valuation 𝜃 over vars(q) such that \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{h+1}})\cup {\mathbf {r}}^{\prime }\) and \(\theta (q)\cap {\mathbf {r}}^{\prime }\neq \emptyset \). Since \(({\mathbf {db}\setminus {\mathbf {o}}_{h+1}})\cup {\mathbf {r}}^{\prime }\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{h}})\cup {\mathbf {r}}\), it follows \(\theta (q)\subseteq ({\mathbf {db}\setminus {\mathbf {o}}_{h}})\cup {\mathbf {r}}\), hence 𝜃(q) ∩r = by our initial hypothesis. It must be the case that 𝜃(q) ∩s. We can assume \(i\in \{0,\dots ,k-1\}\) such that Ai𝜃(q) ∩s. We have 𝜃(Fi)↪ C𝜃(Fi⊕1). Since 𝜃(Fi) = Ai, we have AiC𝜃(Fi⊕1). From AiC𝜃(Fi⊕1) and AiCAi⊕1, it follows \(\theta (F_{i\oplus 1})\sim A_{i\oplus 1}\) by Lemma 4. Therefore, 𝜃(Fi⊕1) ∈block(Ai⊕1,db). Two cases are possible:

Case that \({\mathsf {block}}({A_{i\oplus 1}}, \mathbf {db})\subseteq {\mathbf {o}}_{h}\).:

Since 𝜃(Fi⊕1) ∈ (dboh) ∪r, it must be the case that 𝜃(Fi⊕1) ∈r. However, since we have previously argued that 𝜃(q) ∩r = , we conclude that this case cannot occur.

Case that \({\mathsf {block}}({A_{i\oplus 1}},{\mathbf {db}})\not \subseteq {\mathbf {o}}_{h}\).:

By our definition of s, we have Ai⊕1s. Since \(\theta (F_{i\oplus 1})\in ({\mathbf {db}\setminus {\mathbf {o}}_{h+1}})\cup {\mathbf {r}}^{\prime }\), it must be the case that 𝜃(Fi⊕1) ∈s, and therefore 𝜃(Fi⊕1) = Ai⊕1.

From the above cases, it follows that Ai⊕1𝜃(q) ∩s. By repeating the same reasoning, we obtain that Aj𝜃(q) ∩s for all \(j\in \{0,\dots ,k-1\}\). Since μh(q) ∩oh by our construction, we can assume the existence of \(\ell \in \{0,\dots ,k-1\}\) such that Aoh, hence As, which contradicts our earlier finding that each Aj belongs to 𝜃(q) ∩s. This concludes the induction step. It is correct to conclude that om+ 1 is a garbage set for C in db.

Let \(\mathbf {db}^{\prime }=\mathbf {db}\setminus {\mathbf {o}}_{m+1}\). We show that the garbage set for C in \(\mathbf {db}^{\prime }\) is empty. Assume, toward a contradiction, that o is a nonempty garbage set for C in \(\mathbf {db}^{\prime }\). We can assume a repair r of o such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq ({\mathbf {db}^{\prime }\setminus {\mathbf {o}}})\cup {\mathbf {r}}\), then 𝜃(q) ∩r = .

We show that for any Ar, the ↪ C-graph contains an infinite path that starts from A such that any vertex on the path belongs to \(({\mathbf {db}^{\prime }\setminus {\mathbf {o}}})\cup {\mathbf {r}}\) and any (contiguous) subpath of length k contains some fact from r. To this end, let A be a fact of r. By our construction, there exists a valuation μ over vars(q) such that \(A\in \mu (q)\subseteq \mathbf {db}^{\prime }\) (otherwise A would belong to om+ 1). Hence, \(\mu (F_{0})\stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{1})\stackrel {{~}_{C}}{\hookrightarrow }\dotsm \stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{k-1})\stackrel {{~}_{C}}{\hookrightarrow }\mu (F_{0})\) is a relevant 1-embedding of C in \(\mathbf {db}^{\prime }\) that contains A. Then, for some \(i\in \{0,\dots ,k-1\}\), it must be the case that \(\mu (F_{i})\not \in ({\mathbf {db}^{\prime }\setminus {\mathbf {o}}})\cup {\mathbf {r}}\) (or else \(\mu (q)\subseteq ({\mathbf {db}^{\prime }\setminus {\mathbf {o}}})\cup {\mathbf {r}}\) and μ(q) ∩r, a contradiction). Therefore, the ↪ C-graph contains a shortest path π of length < k from A to some fact Bor. Then, there exists \(B^{\prime }\in {\mathbf {r}}\) such that \(B^{\prime }\sim B\) and the ↪ C-graph contains a path of length < k from A to \(B^{\prime }\). This path is obtained by substituting \(B^{\prime }\) for B in π. Since \(B^{\prime }\in {\mathbf {r}}\), we can continue the path by applying the same reasoning as for A. The path is illustrated by Fig. 10. Since the directed path is infinite, it has a shortest finite subpath of length ≥ k whose first vertex is key-equal to its last vertex. Let D be the last but one vertex on this subpath. Since the ↪ C-graph contains a directed edge from D to the first vertex of the subpath, it contains a cycle of some length nk with n ≥ 1. Since this cycle is obviously an n-embedding of C in \(\mathbf {db}^{\prime }=\mathbf {db}\setminus {\mathbf {o}}_{m+1}\), it must be a relevant 1-embedding of C in \(\mathbf {db}^{\prime }\) which, moreover, contains some fact of r. Therefore, there exists a valuation μ over vars(q) such that \(\mu (q)\subseteq ({\mathbf {db}^{\prime }\setminus {\mathbf {o}}})\cup {\mathbf {r}}\) and μ(q) ∩r, a contradiction.

Fig. 10
figure 10

Illustration of the ↪ C-graph in the proof of Lemma 5. Every vertex is a fact, and the vertex labels indicate the set to which each vertex belongs. Vertices on the same horizontal line are key-equal. Dashed arrows represent (possibly empty) directed paths

Since the garbage set for dbom+ 1 is empty, it follows by Lemma 3 that om+ 1 is the maximum garbage set for C in db. This concludes the proof. □

Proof Proof of Lemma 6

For the first item, let \(A\stackrel {{~}_{C}}{\hookrightarrow }A^{\prime }\) be any edge of the n-embedding. We can assume \(F,F^{\prime }\in C\) such that \(F\stackrel {\mathsf {{~}_{M}}}{\longrightarrow } F^{\prime }\), genreq(A) = F, and \({\mathsf {genre}}_{q}({A^{\prime }})=F^{\prime }\). Then, the block-quotient graph will contain a directed edge from block(A,db) to \({\mathsf {block}}({A^{\prime }},{\mathbf {db}})\). It is then obvious that \(({\mathbf {b}}_{0},{\mathbf {b}}_{1},\dots ,{\mathbf {b}}_{nk-1},{\mathbf {b}}_{0})\) is a directed cycle in the block-quotient graph; this cycle is elementary because no two distinct facts of an n-embedding are key-equal.

For the second item, let \(i\in \{0,\dots ,nk-1\}\). Since (bi,bi+ 1 mod nk) is an edge in the block-quotient graph, we can assume Aibi and \(A^{\prime }\in {\mathbf {b}}_{i+1\mod nk}\) such that \(A_{i}\stackrel {{~}_{C}}{\hookrightarrow }A^{\prime }\). By Lemma 4, it will be the case that \(A_{0}\stackrel {{~}_{C}}{\hookrightarrow }A_{1}\stackrel {{~}_{C}}{\hookrightarrow }\dotsm \stackrel {{~}_{C}}{\hookrightarrow }{A_{nk-1}}\stackrel {{~}_{C}}{\hookrightarrow }A_{0}\). Furthermore, the latter ↪ C-cycle is an n-embedding. Indeed, since the cycle \(({\mathbf {b}}_{0},{\mathbf {b}}_{1},\dots ,{\mathbf {b}}_{nk-1},{\mathbf {b}}_{0})\) is elementary, no two distinct Ais are key-equal. This concludes the proof. □

1.2 C.2 Proof of Lemma 8

We will use the following helping lemma. If G is a directed graph, then a directed cycle in G of length k is called a k-cycle.

Lemma 15

Let G = (V,E) be an instance of LONGCYCLE(k). Let \(\widehat {G}=(\widehat {V},\widehat {E})\) be the undirected graph whose vertices are the k-cycles of G. There is an undirected edge between any two distinct k-cycles P1 and P2 if V (P1) ∩ V (P2)≠. Then, the following are equivalent:

  1. 1.

    \(\widehat {G}\) has a chordless cycle of length ≥ 2k or G has an elementary directed cycle of length nk with 2 ≤ n ≤ 2k − 3.

  2. 2.

    G contains an elementary directed cycle of length ≥ 2k.

Proof

Since the graph G is k-partite, every k-cycle is elementary.

Assume that 1 holds true. The result is obvious if there exists n such that 2 ≤ n ≤ 2k − 3 and G has an elementary cycle of length nk. Assume next that \(\widehat {G}\) has a chordless elementary cycle \((P_{0}, P_{1}, \dots , P_{m-1}, P_{0})\) of length m ≥ 2k. We construct a cycle C in G using the following procedure. The construction will define a labeling function from the vertices in C to \(\{0,1,\dots ,m-1\}\). It will be the case that wV (P(w)) for every vertex w in C. We start with any vertex v0V (Pm− 1) ∩ V (P0) and define its label as \(\ell (v_{0})\mathrel {\mathop :}= 0\). At any point of the procedure, if we are at vertex u with label (u), we choose the next vertex w in C to be the next vertex in the k-cycle P(u). If (u) < m − 1 and w also belongs to P(u)+ 1, we let \(\ell (w)\mathrel {\mathop :}=\ell (u)+1\); otherwise \(\ell (w)\mathrel {\mathop :}=\ell (u)\). The procedure terminates when we attempt to add a vertex that already exists in C, and therefore C will be elementary.

We first show that the termination condition will not be met for any vertex distinct from v0. Suppose, toward a contradiction, that the sequence constructed so far is \(C = \langle {v_{0}, v_{1}, \dots , v_{n}}\rangle \), (vn) = im − 1, and the next vertex in Pi is some vj with \(j\in \{1, \dots , n-1\}\). Since vj belongs to both Pi and \(P_{\ell (v_{j})}\), it must be the case that (vj) ≥ i − 1, because otherwise \(\{P_{i},P_{\ell (v_{j})}\}\) is a chord in \((P_{0}, P_{1}, \dots , P_{m-1}, P_{0})\), a contradiction. We now distinguish two cases:

Case (vj) = i − 1.:

Then, vjV (Pi− 1) ∩ V (Pi). By the procedure, this means that (vj− 1) = i − 2. Indeed, if (vj− 1) = i − 1, then the procedure would have set (vj) to i, because vj also belongs to Pi. But then this also implies that vjV (Pi− 2), a contradiction to the fact that the cycle is chordless.

Case (vj) = i.:

Then the procedure reaches a vertex on Pi that has been visited before. Therefore, starting with this previously visited vertex on Pi, the procedure has entirely traversed Pi without ever reaching a vertex of Pi+ 1 mod m, contradicting that Pi and Pi+ 1 mod m have a vertex in common.

It is now clear that at some point we will reach v0. Indeed, when the label becomes m − 1, the procedure will follow the edges of Pm− 1 until it reaches v0. We have that (v0) = 0, and the procedure is such that if some vertex has label i with i < m − 1, then there is a vertex with label i + 1. Therefore, for every \(i\in \{0,1,\dots ,m-1\}\), there exists at least one vertex u in C such that (u) = i. Therefore, C has at least m vertices. Since m ≥ 2k, the cycle C has length ≥ 2k.

Assume that

  • G contains an elementary directed cycle of length ≥ 2k, and

  • for all 2 ≤ n ≤ 2k − 3, G contains no elementary directed cycle of length nk.

We will show that \(\widehat {G}\) contains a chordless cycle of length ≥ 2k.

We first introduce some notions that will be useful in the proof. A subpath of a directed path is a consecutive subsequence of edges of that path. Every path is a subpath of itself. We write start(π) and end(π) to denote, respectively, the first and the last vertex of a directed path π. If \(\mathsf {end}({\pi })=\mathsf {start}({\pi ^{\prime }})\), then \(\pi \cdot \pi ^{\prime }\) denotes the concatenation of paths π and \(\pi ^{\prime }\). The length of a (possibly closed) elementary path π is the number of edges it contains, and is denoted length(π).Covering Let O be an elementary cycle in G of size ≥ 2k. A seam in O is a subpath of O that is also a subpath of some k-cycle. Obviously, every seam in O has length < k. A covering of O is a set of edge-disjoint seams in O such that every edge of O is an edge of some seam in the set. Since every edge of G belongs to some k-cycle by our hypothesis, O has a covering. We define \({\mathit {seamlength}}({O})\mathrel {\mathop :}=\ell \) if O has a covering of cardinality and every covering of O has cardinality ≥ .Cyclic Ordering of the Seams in a Covering Let \(C=\{S_{0},S_{1},\dots ,S_{\ell -1}\}\) be a covering of O. From here on, we will assume that the seams are listed such that a traversal of O that starts with start(S0) traverses the seams of C in the order S0, S1, …, S− 1.

Let O be a directed cycle of length ≥ 2k that minimizes seamlength(⋅). From here on, denotes seamlength(O). Thus, every elementary cycle \(O^{\prime }\) in G of length ≥ 2k satisfies \({\mathit {seamlength}}({O^{\prime }})\geq \ell \). Let \(\{S_{0},S_{1},\dots ,S_{\ell -1}\}\) be a covering of O.

Our hypothesis is that for every directed cycle of length nk in G such that n ≥ 2, we have n > 2k − 3. Consequently, length(O) ≥ (2k − 2)k. For every \(i\in \{0,\dots ,\ell -1\}\), we have length(Si) ≤ k − 1 (because O is elementary with length(O) ≥ 2k). Therefore, \((2k-2)k\leq {\mathit {length}}({O})={\sum }_{i=0}^{\ell -1}{\mathit {length}}({S_{i}})\leq \ell (k-1)\), which implies ≥ 2k.

For every \(i\in \{0,\dots ,\ell -1\}\), let Pi be a k-cycle of which Si is a subpath. We define the fitness of Pi as \({\mathit {length}}({S_{i}^{\prime }})\) if \(S_{i}^{\prime }\) is the longest subpath of Pi that has Si as a subpath and that is still a seam in O. Note that the fitness of Pi is at least length(Si). For a reason that will become apparent shortly, if multiple choices for the k-cycle Pi are possible, we will choose a k-cycle with the greatest fitness. Assume, toward a contradiction, that the subgraph of \(\widehat {G}\) induced by \(\{P_{0},P_{1},\dots ,P_{\ell -1}\}\) has a cycle chord. We can assume without loss of generality \(m\in \{2,\dots ,\ell -2\}\) and a path \((P_{0},P_{1},\dots ,P_{m-1},P_{m})\) in \(\widehat {G}\) such that \(\{P_{0},P_{m}\}\in E(\widehat {G})\), while the paths \((P_{0},P_{1},\dots ,P_{m-1})\) and \((P_{1},\dots ,P_{m-1},P_{m})\) are chordless. From \(\{P_{0},P_{m}\}\in E(\widehat {G})\), it follows that V (P0) ∩ V (Pm)≠. We have V (S0) ∩ V (Sm) = . Let π be the closed directed path in G that, starting from start(Sm), traverses Pm until a vertex (call it x) of P0 is reached. From x on, the path π follows P0 until end(S0) is reached, and then traverses \(S_{1},S_{2},\dots ,S_{m-1}\). Note that it is possible that xV (Sm) or xV (S0) (but not both). We argue next that π is an elementary cycle.

The edges of π that are not in O belong either to the subpath (call it πm) of Pm that goes from end(Sm) to x, or to the subpath (call it π0) of P0 that goes from x to start(S0). Note that πm exists only if xV (Sm), and π0 exists only if xV (S0). Assume toward a contradiction that π is not elementary. From our hypotheses and construction, it must be the case that πm intersects Sm− 1 in some vertex y, or that π0 intersects S1 in some vertex z. These possibilities are depicted in Fig 11. If this happens, however, \(P_{m}^{\prime }\) and \(P_{0}^{\prime }\) have a strictly greater fitness than Pm and P0, contradicting that we chose k-cycles with the greatest fitness. Here, \(P_{m}^{\prime }\) is the k-cycle that, starting from end(Sm− 1) = start(Sm), traverses Pm until y, and then follows Pm− 1 from y until end(Sm− 1). Similarly, \(P_{0}^{\prime }\) is the k-cycle that, starting from end(S0) = start(S1), traverses P1 until z, and then follows P0 from z until end(S0). To see that \(P_{m}^{\prime }\) has a strictly greater fitness than Pm, note that the subpath of \(P_{m}^{\prime }\) from y to end(Sm) is a seam of O. Since xV (Sm− 1), Pm will cover a strictly smaller suffix of Sm− 1 than \(P_{m}^{\prime }\) does.

Fig. 11
figure 11

Two possible configurations. P0 is the left outermost closed curve containing S0, x, and z; Pm is the right outermost closed curve containing Sm, y, and x

We show that both length(π) = k and length(π) ≥ 2k lead to a contradiction.

  • Assume that π is a k-cycle. Then either \(S_{0}\cdot S_{1}\cdot \dotsm \cdot S_{m-1}\) is a seam of O or \(S_{1}\cdot S_{2}\cdot \dotsm \cdot S_{m}\) is a seam of O. Since m ≥ 2, we can use π to construct a covering of O of cardinality < , a contradiction.

  • Assume that length(π) ≥ 2k. It can be easily seen that π has a covering of cardinality m + 1 < , which contradicts our assumption about O.

The proof of Lemma 8 can now be given.

Proof Proof of Lemma 8

Let G = (V,E) be an instance of LONGCYCLE(k). Let \(\widehat {G}=(\widehat {V},\widehat {E})\) be the undirected graph defined in the statement of Lemma 15. Obviously, it suffices to show that Condition 1 in the statement of Lemma 15 can be expressed in SymStratDatalog.

All elementary cycles in G of length nk for 2 ≤ n ≤ 2k − 3 can obviously be found in FO. We now outline a program in SymStratDatalog that tests for the existence of chordless cycles in \(\widehat {G}\) of length ≥ 2k. The graph \(\widehat {G}\) can be constructed in SymStratDatalog. Then, the existence of a chordless cycle of length ≥ 2k can be tested as follows: Check whether there exists a path \((P_{0},P_{1},P_{2},\dots ,P_{2k-2},P_{2k-1},P_{2k})\) such that (i) the subpath \((P_{1},\dots ,P_{2k-1})\) is elementary and chordless, and (ii) the endpoints P0 and P2k are also connected by another (possibly single-vertex) path that uses no vertex that is equal or adjacent to a vertex in \(\{P_{2},\dots ,P_{2k-2}\}\). In particular, P0 and P2k themselves must then be distinct from and not adjacent to the vertices in \(\{P_{2},\dots ,P_{2k-2}\}\), and, consequently, P0P1 and P2kP2k− 1. The single-vertex path occurs if P0 = P2k.

We now give the details of the SymStratDatalog program. The following rule states that the vertices of \(\widehat {G}\) are the k-cycles of G.

$$ \widehat{V}(x_{0},\dots,x_{k-1}) \leftarrow E(x_{0},x_{1}),E(x_{1},x_{2}),\dots,E(x_{k-2},x_{k-1}),E(x_{k-1},x_{0}) $$

Note incidentally that every k-cycle is stored k times in this way. Since the graph G is k-circle-layered (see Definition 7), we can assume some fixed partition \(V_{0},V_{1},\dots ,V_{k-1}\) of the vertex set V. We will say that the IDB fact \(\widehat {V}(a_{0},\dots ,a_{k-1})\) is of class Vi if a0Vi. Thus, if \(\widehat {V}(a_{0},a_{1},\dots ,a_{k-1})\) is of class Vi, then \(\widehat {V}(a_{1},\dots ,a_{k-1},a_{0})\) is of class Vi+ 1 mod k. If one partition class would be given as a part of the input, for example as EDB facts V0(a), then an optimization consists in adding V0(x0) to the body of the previous rule.

We will need an equality test on vertices of \(\widehat {G}\):

$$ \mathit{Eq}(x_{0},\dots,x_{k-1};x_{0},\dots,x_{k-1}) \leftarrow \widehat{V}(x_{0},\dots,x_{k-1}) $$

The use of the semicolon is for readability only. The following rules compute edges in \(\widehat {G}\). For every \(\ell \in \{0,\dots ,k-1\}\), add the rules:

$$ \mathit{\widehat{E}}(x_{0},\dots,x_{k-1};y_{0},\dots,y_{k-1}) \leftarrow \left\{ \begin{array}{l} \widehat{V}(x_{0},\dots,x_{k-1}),\widehat{V}(y_{0},\dots,y_{k-1}),\\[1.0ex] \neg\mathit{Eq}(x_{0},\dots,x_{k-1};y_{0},\dots,y_{k-1}),\\[1.0ex] x_{\ell}=y_{\ell} \end{array} \right. $$

Note that whenever \(\mathit {\widehat {E}}(a_{0},\dots ,a_{k-1};b_{0},\dots ,b_{k-1})\) holds true, then \(\widehat {V}(a_{0},\dots ,a_{k-1})\) and \(\widehat {V}(b_{0},\dots ,b_{k-1})\) will be IDB \(\widehat {V}\)-facts of the same class. In fact, it is sufficient to compute chordless cycles all of whose \(\widehat {V}\)-facts are of the same class. From here on, we write \(\vec {x}\) for the sequence \(\langle {x_{0},\dots ,x_{k-1}}\rangle \). Superscripts are used to create new variables: x(i) and x(j) are distinct variables unless i = j. Finally, \({\vec {x}}^{(i)}\) is the sequence \({x_{0}}^{(i)},\dots ,{x_{k-1}}^{(i)}\). Likewise for \(\vec {y}=\langle {y_{0},\dots ,y_{k-1}}\rangle \), \(\vec {z}=\langle {z_{0},\dots ,z_{k-1}}\rangle \), and \(\vec {w}=\langle {w_{0},\dots ,w_{k-1}}\rangle \). Add the following rule, as well as its symmetric rule:

$$ \mathit{UCon}(\vec{x},\vec{y},{\vec{z}}^{(1)},\dots,{\vec{z}}^{(2k-3)}) \leftarrow \left\{ \begin{array}{l} \mathit{UCon}(\vec{x},\vec{w},{\vec{z}}^{(1)},\dots,{\vec{z}}^{(2k-3)}), \widehat{E}(\vec{w},\vec{y}),\\ \\ \left\{\neg\mathit{Eq}(\vec{w},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3}, \left\{\neg\widehat{E}(\vec{w},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3}\\ \\ \left\{\neg\mathit{Eq}(\vec{y},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3}, \left\{\neg\widehat{E}(\vec{y},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3} \end{array} \right. $$

\(\mathit {UCon}(\vec {a},\vec {b},\vec {c}_{1},\dots ,\vec {c}_{2k-3})\) holds true if \(\widehat {G}\) contains an undirected path between \(\vec {a}\) and \(\vec {b}\) such that no vertex on the path is equal or adjacent to some \(\vec {c}_{i}\). The basis of the recursion is the following rule:

$$ \mathit{UCon}(\vec{x},\vec{x},{\vec{z}}^{(1)},\dots,{\vec{z}}^{(2k-3)}) \leftarrow \left\{ \begin{array}{l} \widehat{V}(\vec{x}),\widehat{V}({\vec{z}}^{(1)}),\dots,\widehat{V}({\vec{z}}^{(2k-3)}),\\[1.0ex] \left\{\neg\mathit{Eq}(\vec{x},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3}, \left\{\neg\widehat{E}(\vec{x},{\vec{z}}^{(i)})\right\}_{i=1}^{2k-3} \end{array} \right. $$

Finally, the following rule tests for the existence of a chordless cycle in \(\widehat {G}\) of length ≥ 2k.

$$ \mathit{Chordless}() \leftarrow \left\{ \begin{array}{l} \widehat{E}({\vec{x}}^{(0)},{\vec{x}}^{(1)}),\widehat{E}({\vec{x}}^{(1)},{\vec{x}}^{(2)}),\dots,\widehat{E}({\vec{x}}^{(2k-1)},{\vec{x}}^{(2k)}),\\[1.0ex] \left\{\neg\mathit{Eq}({\vec{x}}^{(i)},{\vec{x}}^{(j)})\right\}_{1\leq i<j\leq 2k-1},\\ \\ \left\{\neg\widehat{E}({\vec{x}}^{(i)},{\vec{x}}^{(j)})\right\}_{1\leq i<i+1<j\leq 2k-1},\\ \\ \mathit{UCon}({\vec{x}}^{(0)},{\vec{x}}^{(2k)},{\vec{x}}^{(2)},\dots,{\vec{x}}^{(2k-2)}) \end{array} \right. $$

This concludes the proof. □

1.3 C.3 Illustration of the Datalog Program in the Proof of Lemma 9

The following example illustrates the Datalog program in the proof of Lemma 9.

Example 5

Let \(q=\{R(\underline {x},y,z), S(\underline {y},x,z), U(\underline {z},a)\}\), where a is a constant. We show a program in symmetric stratified Datalog that computes the garbage set for the M-cycle \(C=R(\underline {x},y,z)\stackrel {\mathsf {{~}_{M}}}{\longrightarrow } S(\underline {y},x,z)\stackrel {\mathsf {{~}_{M}}}{\longrightarrow } R(\underline {x},y,z)\). In this example, k = 2. The program is constructed as in the proof of Lemma 9.

R-facts and S-facts belong to the maximum garbage set if they do not belong to a relevant 1-embedding. This is expressed by the following rules.

$$ \begin{array}{@{}rcl@{}} \mathsf{Rlvant{R}}(x,y,z) &\leftarrow& R(x,y,z), S(y,x,z), U(z,a)\\ \mathsf{Garbage{R}}(x) &\leftarrow& R(x,y,z), \neg\mathsf{Rlvant{R}}(x,y,z)\\ \mathsf{Rlvant{S}}(y,x,z) &\leftarrow& R(x,y,z), S(y,x,z), U(z,a)\\ \mathsf{Garbage{S}}(y) &\leftarrow& S(y,x,z), \neg\mathsf{Rlvant{S}}(y,x,z) \end{array} $$

If some R-fact or S-fact of a relevant 1-embedding belongs to the maximum garbage set, then every fact of that 1-embedding belongs to the maximum garbage set. This is expressed by the following rules.

$$ \begin{array}{@{}rcl@{}} \mathsf{Garbage{R}}(x) &\leftarrow& R(x,y,z), S(y,x,z), U(z,a), \mathsf{Garbage{S}}(y)\\ \mathsf{Garbage{S}}(y) &\leftarrow& R(x,y,z), S(y,x,z), U(z,a), \mathsf{Garbage{R}}(x) \end{array} $$

Note that the predicates GarbageR and GarbageS refer to blocks: whenever a fact is added to the garbage set, its entire block is added. The following rules compute irrelevant 1-embeddings.

$$ \begin{array}{@{}rcl@{}} \mathsf{Any1Emb}(x,y,z,y^{\prime},x^{\prime},z^{\prime}) &\leftarrow& \left\{ \begin{array}{l} R(x,y,z), S(y,x,z), U(z,a),\\ R(x^{\prime},y^{\prime},z^{\prime}), S(y^{\prime},x^{\prime},z^{\prime}), U(z^{\prime},a),\\ x=x^{\prime}, y=y^{\prime} \end{array} \right.\\ \mathsf{Rel1Emb}(x,y,z,y,x,z) &\leftarrow& R(x,y,z), S(y,x,z), U(z,a)\\ \mathsf{Irr1Emb}(x,y^{\prime}) &\leftarrow& \mathsf{Any1Emb}(x,y,z,y^{\prime},x^{\prime},z^{\prime}),\\ &&\neg\mathsf{Rel1Emb}(x,y,z,y^{\prime},x^{\prime},z^{\prime}) \end{array} $$

The predicate \(\mathsf {\widehat {E}}\) is used for edges between vertices; each vertex is a (x,y)-value. The predicate Eq expresses equality of vertices.

$$ \mathsf{Eq}({x},{y},{x},{y}) \leftarrow R(x,y,z), S(y,x,z), U(z,a) $$
$$ \begin{array}{@{}rcl@{}} \mathsf{\widehat{E}}({x},{y},{x^{\prime}},{y^{\prime}}) \leftarrow \left\{ \begin{array}{l} R(x,y,z), S(y,x,z), U(z,a),\\ R(x^{\prime},y^{\prime},z^{\prime}), S(y^{\prime},x^{\prime},z^{\prime}), U(z^{\prime},a),\\ \neg\mathsf{Eq}({x},{y},{x^{\prime}},{y^{\prime}}), x=x^{\prime} \end{array} \right.\\ \mathsf{\widehat{E}}({x},{y},{x^{\prime}},{y^{\prime}}) \leftarrow \left\{ \begin{array}{l} R(x,y,z), S(y,x,z), U(z,a),\\ R(x^{\prime},y^{\prime},z^{\prime}), S(y^{\prime},x^{\prime},z^{\prime}), U(z^{\prime},a),\\ \neg\mathsf{Eq}({x},{y},{x^{\prime}},{y^{\prime}}), y=y^{\prime} \end{array} \right. \end{array} $$

The predicate UCon is used for undirected connectivity of the \(\mathsf {\widehat {E}}\)-predicate. In particular, it will be the case that UCon(a1,b1,a2,b2,a3,b3) holds true if there exists a path between vertices (a1,b1) and (a2,b2) such that no vertex on the path is equal or adjacent to (a3,b3). Recall that each vertex is itself a pair.

$$ \begin{array}{@{}rcl@{}} \mathsf{UCon}({x}_{1},{y}_{1},{x}_{1},{y}_{1},{x}_{3},{y}_{3}) &\leftarrow& \left\{ \begin{array}{l} R(x_{1},y_{1},z_{1}), S(y_{1},x_{1},z_{1}), U(z_{1},a),\\ R(x_{3},y_{3},z_{3}), S(y_{3},x_{3},z_{3}), U(z_{3},a),\\ \neg\mathsf{Eq}({x}_{1},{y}_{1},{x}_{3},{y}_{3}), \neg\mathsf{\widehat{E}}({x}_{1},{y}_{1},{x}_{3},{y}_{3}) \end{array} \right.\\ \mathsf{UCon}({x}_{1},{y}_{1},{x}_{2},{y}_{2},{x}_{3},{y}_{3}) &\leftarrow& \left\{ \begin{array}{l} \mathsf{UCon}({x}_{1},{y}_{1},{x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}), \mathsf{\widehat{E}}({x}_{\dagger},{y}_{\dagger},{x}_{2},{y}_{2}),\\ \neg\mathsf{Eq}({x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}),\neg\mathsf{\widehat{E}}({x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}),\\ \neg\mathsf{Eq}({x}_{2},{y}_{2},{x}_{3},{y}_{3}),\neg\mathsf{\widehat{E}}({x}_{2},{y}_{2},{x}_{3},{y}_{3}) \end{array} \right.\\ \mathsf{UCon}({x}_{1},{y}_{1},{x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}) &\leftarrow& \left\{ \begin{array}{l} \mathsf{UCon}({x}_{1},{y}_{1},{x}_{2},{y}_{2},{x}_{3},{y}_{3}), \mathsf{\widehat{E}}({x}_{\dagger},{y}_{\dagger},{x}_{2},{y}_{2}),\\ \neg\mathsf{Eq}({x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}),\neg\mathsf{\widehat{E}}({x}_{\dagger},{y}_{\dagger},{x}_{3},{y}_{3}),\\ \neg\mathsf{Eq}({x}_{2},{y}_{2},{x}_{3},{y}_{3}),\neg\mathsf{\widehat{E}}({x}_{2},{y}_{2},{x}_{3},{y}_{3}) \end{array} \right. \end{array} $$

The latter two rules are each other’s symmetric version. The following rule checks whether a vertex (a1,b1) belongs to a chordless \(\mathsf {\widehat {E}}\)-cycle of length ≥ 2k.

$$ \begin{array}{@{}rcl@{}} \mathsf{InLongUCycle}({x}_{1},{y}_{1}) \leftarrow \left\{ \begin{array}{l} \mathsf{\widehat{E}}({x}_{0},{y}_{0},{x}_{1},{y}_{1}), \mathsf{\widehat{E}}({x}_{1},{y}_{1},{x}_{2},{y}_{2}),\\ \mathsf{\widehat{E}}({x}_{2},{y}_{2},{x}_{3},{y}_{3}), \mathsf{\widehat{E}}({x}_{3},{y}_{3},{x}_{4},{y}_{4}),\\ \neg\mathsf{\widehat{E}}({x}_{1},{y}_{1},{x}_{3},{y}_{3}),\\ \neg\mathsf{Eq}({x}_{1},{y}_{1},{x}_{2},{y}_{2}), \neg\mathsf{Eq}({x}_{1},{y}_{1},{x}_{3},{y}_{3}), \neg\mathsf{Eq}({x}_{2},{y}_{2},{x}_{3},{y}_{3}),\\ \mathsf{UCon}({x}_{0},{y}_{0},{x}_{4},{y}_{4},{x}_{2},{y}_{2}) \end{array} \right. \end{array} $$

The following rules add to the maximum garbage sets all R-facts and S-facts that belong to an irrelevant 1-embedding or to a strong component of the ↪ C-graph that contains an elementary ↪ C-cycle of length ≥ 2k. Whenever a fact is added, all facts of its block are added.

$$ \begin{array}{@{}rcl@{}} \mathsf{Garbage{R}}(x) &\leftarrow& \mathsf{InLongUCycle}(x,y)\\ \mathsf{Garbage{S}}(y) &\leftarrow& \mathsf{InLongUCycle}(x,y)\\ \mathsf{Garbage{R}}(x) &\leftarrow& \mathsf{Irr1Emb}(x,y)\\ \mathsf{Garbage{S}}(y) &\leftarrow& \mathsf{Irr1Emb}(x,y) \end{array} $$

This terminates the computation of the garbage set. In general, we have to check the existence of elementary ↪ C-cycles of length nk with 2 ≤ n ≤ 2k − 3. However, for k = 2, no such n exists.

1.4 C.4 Proof of Lemma 10

Proof Proof of Lemma 10

Let \(q^{\prime }=({q\setminus C})\cup \{T\}\). For every \(i\in \{0,1,\dots ,k-1\}\), let Fi = \(R_{i}(\underline {\vec {x}_{i}},\vec {y}_{i})\). Here is an informal visual representation of the different queries involved:

figure c

Proof of the First Item We show the existence of a reduction from CERTAINTY(q) to the problem \({\mathsf {CERTAINTY}}({q^{\prime }\cup p})\) that is expressible in \({\mathit {SymStratDatalog}}^{\min \limits }\). We first describe the reduction, and then show that it can be expressed in \({\mathit {SymStratDatalog}}^{\min \limits }\).

Let db0 be a database that is input to CERTAINTY(q). By Lemma 9, we can compute in symmetric stratified Datalog the maximum garbage set o for C in db0. Let db = db0o. We know, by Lemma 2, that the problem CERTAINTY(q) has the same answer on instances db0 and db. Moreover, by Lemma 3, every garbage set for C in db is empty, which implies, by Lemma 5, that (i) every n-embedding of C in db must be a relevant 1-embedding, and (ii) every fact A with genreq(A) ∈ C belongs to a 1-embedding. The reduction will now encode all these 1-embeddings as T-facts.

We show that every directed edge of the ↪ C-graph belongs to a directed cycle. To this end, take any edge ACB. Since every garbage set for C in db is empty, the ↪ C-graph contains a relevant 1-embedding containing A, and a relevant 1-embedding containing B. Let \(A^{\prime }\) be the fact such that \(A^{\prime }\stackrel {{~}_{C}}{\hookrightarrow }B\) is a directed edge in the 1-embedding containing B. Let \(B^{\prime }\) be the fact such that \(A\stackrel {{~}_{C}}{\hookrightarrow }B^{\prime }\) is a directed edge in the 1-embedding containing A. Since ACB and \(A\stackrel {{~}_{C}}{\hookrightarrow }B^{\prime }\), it follows \(B\sim B^{\prime }\) by Lemma 4. From \(A^{\prime }\stackrel {{~}_{C}}{\hookrightarrow }B\) and \(B\sim B^{\prime }\), it follows \(A^{\prime }\stackrel {{~}_{C}}{\hookrightarrow }B^{\prime }\). Thus, the ↪ C-graph contains a directed path from B to \(A^{\prime }\), an edge from \(A^{\prime }\) to \(B^{\prime }\), and a directed path from \(B^{\prime }\) to A. Consequently, the ↪ C-graph contains a directed path from B to A.

It follows that every strong component of the ↪ C-graph is initial. It can be easily seen that if an initial strong component contains some fact A, then it contains every fact that is key-equal to A. Let r be a repair of db. For every fact Ar, there exists a unique fact Br such that ACB. It follows that r must contain an elementary ↪ C-cycle, which must be a relevant 1-embedding (because every garbage set for C in db is empty) belonging to the same initial strong component as A. It can also be seen that there exists a repair that contains exactly one such 1-embedding for every strong component of the ↪ C-graph.

We define an undirected graph G as follows: for each valuation μ over vars(q) such that \(\mu (q)\subseteq \mathbf {db}\), we introduce a vertex 𝜃 with 𝜃 = μ[vars(C)]. We add an edge between two vertices 𝜃 and \(\theta ^{\prime }\) if for some \(i\in \{0,\dots ,k-1\}\), \(\theta (\vec {x}_{i})=\theta ^{\prime }(\vec {x}_{i})\). The graph G can clearly be constructed in logarithmic space (and even in FO). We define a set dbT of T-facts and, for every \(i\in \{0,\dots ,k-1\}\), a set dbi as follows: for all two vertices 𝜃, \(\theta ^{\prime }\) of G, if

$$ \theta^{\prime}(\vec{x}_{0})=\min\left\{\theta^{\prime\prime}(\vec{x}_{0})\mid \theta^{\prime\prime}\in V(G) \text{ belongs to the same strong component as } \theta \right\}, $$

then we add to dbT the fact \({\theta }_{[{{u}\mapsto {\theta ^{\prime }(\vec {x}_{0})}}]}(T)\), and we add to dbi the fact \({\theta }_{[{{u}\mapsto {\theta ^{\prime }(\vec {x}_{0})}}]}(N_{i})\). In this way, every dbi is consistent. Informally, if T is the atom \(T(\underline {u},\vec {w})\), then we add to dbT the T-fact \(T(\underline {\theta ^{\prime }(\vec {x}_{0})},\theta (\vec {w}))\), where \(\theta ^{\prime }(\vec {x}_{0})\) is treated as a single value. This fact represents that 𝜃 belongs to the strong component that is identified by \(\theta ^{\prime }(\vec {x}_{0})\). Since undirected connectivity can be computed in logarithmic space [33], dbT and each dbi can be constructed in logarithmic space.

Let dbC be the set of all Fi-facts in db (0 ≤ ik − 1), and let \(\mathbf {db}_{{\mathsf {shared}}}\mathrel {\mathop :}=\mathbf {db}\setminus \mathbf {db}_{C}\), the part of the database db that is preserved by the reduction. Let \(\mathbf {db}_{N}=\bigcup _{i=0}^{k-1}\mathbf {db}_{i}\). Since dbN is consistent, dbshareddbTdbN is a legal input to \({\mathsf {CERTAINTY}}({q^{\prime }\cup p})\), where the use of ⊎ (rather than ∪) indicates that the operands of the union are disjoint. Here is an informal visual representation of the reduction:

figure d

We show that the following are equivalent:

  1. 1.

    Every repair of db satisfies q.

  2. 2.

    For every srset(dbshared), for every repair rT of dbT, \({\mathbf {s}}\uplus {\mathbf {r}}_{T}\uplus \mathbf {db}_{N}\models q^{\prime }\cup p\).

  3. 3.

    Every repair of dbshareddbTdbN satisfies \(q^{\prime }\cup p\).

The equivalence 2 ⇔ 3 is straightforward. We show next the equivalence 1 ⇔ 2. Let srset(dbshared) and let rT be a repair of dbT. By our construction of dbT, there exists a repair rC of dbC such that for every valuation 𝜃 over vars(q), if \(\theta (q)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{C}\), then for some value c, \({\theta }_{[{{u}\mapsto {c}}]}(q^{\prime }\cup p)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{T}\cup \mathbf {db}_{N}\). Informally, rC contains all (and only) the relevant 1-embeddings of C in ∪rC that are encoded by the T-facts of rT. Since srC is a repair of db, by the hypothesis 1, we can assume a valuation 𝜃 over vars(C) such that \(\theta (q)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{C}\). Consequently, for some value c, \({\theta }_{[{{u}\mapsto {c}}]}(q^{\prime }\cup p)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{T}\cup \mathbf {db}_{N}\). Let r be a repair of db. There exist srset(dbshared) and rCrset(dbC) such that r = srC. By the construction of dbT, there exists a repair rT of dbT such that for every valuation 𝜃 over vars(q), if \({\theta }_{[{{u}\mapsto {c}}]}(q^{\prime }\cup p)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{T}\cup \mathbf {db}_{N}\) for some c, then \(\theta (q)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{C}\) (note incidentally that the converse does not generally hold). Informally, for every strong component \(\mathcal {S}\) of the ↪ C-graph of db such that \({\mathbf {s}}\cup ({{\mathbf {r}}_{C}\cap V(\mathcal {S})})\models q\), the set rT encodes one 1-embedding of C in \({\mathbf {s}}\cup ({{\mathbf {r}}_{C}\cap V(\mathcal {S})})\). Here, \(V(\mathcal {S})\) denotes the vertex set of the strong component \(\mathcal {S}\); thus \(V(\mathcal {S})\subseteq \mathbf {db}_{C}\). Since srTdbN is a repair of dbshareddbTdbN, it follows by the hypothesis 2 that there exists a valuation 𝜃 over vars(q) such that \({\theta }_{[{{u}\mapsto {c}}]}(q^{\prime }\cup p)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{T}\cup \mathbf {db}_{N}\) for some c. Consequently, \(\theta (q)\subseteq {\mathbf {s}}\cup {\mathbf {r}}_{C}\).

In the main body of this article, we have shown a program in \({\mathit {SymStratDatalog}}^{\min \limits }\) that computes the reduction.Proof of the Second Item Assume that the attack graph of q contains no strong cycle and that some initial strong component of the attack graph contains every atom of \(\{F_{0},F_{1},\dots ,F_{k-1}\}\). Since all Ni-facts have mode c, they have no outgoing attacks in the attack graph of \(q^{\prime }\cup p\). Since \({\mathsf {vars}}({N_{i}})\subseteq {\mathsf {vars}}({T})\) for every atom Nip, we can limit our analysis to witnesses for attacks that do not contain any Ni. Indeed, if Ni would occur in a witness, it can be replaced with T. Let \(\mathcal {S}\) be an initial strong component of the attack graph of q that contains every atom of \(\{F_{0},F_{1},\dots ,F_{k-1}\}\). We will use the following properties:

  1. (a)

    For all \(X,Y\subseteq \mathsf {vars}({q})\), if \({\mathcal {K}}({q})\models {X}\rightarrow {Y}\), then \({\mathcal {K}}({q^{\prime }\cup p})\models {X}\rightarrow {Y}\). This holds true because \({\mathcal {K}}({q^{\prime }\cup p})\models {\mathcal {K}}({q})\). To prove the latter claim, note that \({\mathcal {K}}({q})\setminus {\mathcal {K}}({q^{\prime }\cup p})=\{{{\mathsf {key}}({F_{i}})}\rightarrow {{\mathsf {vars}}({F_{i}})}\}_{i=0}^{k-1}\). For all \(i\in \{0,1,\dots ,k-1\}\), we have that \({\mathcal {K}}({\{T,N_{i}\}})\equiv \{{u}\rightarrow {\mathsf {vars}({C})},{{\mathsf {key}}({F_{i}})}\rightarrow {u}\}\) with \({\mathsf {vars}}({F_{i}})\subseteq \mathsf {vars}({C})\). Consequently, \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({F_{i}})}\rightarrow {{\mathsf {vars}}({F_{i}})}\).

  2. (b)

    As an immediate consequence of (a), we have \({H}^{+,{q}}\subseteq {H}^{+,{q^{\prime }\cup p}}\) for every HqC.

  3. (c)

    For every HqC, if \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\), then \(H\in \mathcal {S}\). To show this result, let HqC such that \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\). We can assume without loss of generality the existence of a witness for \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\) of the form \(\omega \stackrel {v}{\smallfrown }T\) with vu, where the sequence ω starts with H. We can assume the existence of \(j\in \{0,\dots ,k-1\}\) such that vvars(Fj). From the preceding property (b), it follows that the sequence \(\omega \stackrel {v}{\smallfrown }F_{j}\) is a witness for \(H\overset {q}{\rightsquigarrow }F_{j}\). Since \(F_{j}\in \mathcal {S}\), we conclude \(H\in \mathcal {S}\).

  4. (d)

    For all \(G,H\!\in \!\mathcal {S}\), we have \({\mathcal {K}}({q^{\prime }\!\cup \! p})\!\models \!{{\mathsf {key}}({G})}\!\rightarrow \!{{\mathsf {key}}({H})}\). To show this result, let \(G,H\in \mathcal {S}\). Since \(\mathcal {S}\) is an initial strong component of the attack graph of q, there exists an elementary attack cycle that contains both G and H. Since the attack graph of q contains no strong cycle, for every edge \(J\overset {q}{\rightsquigarrow }J^{\prime }\) on this attack cycle, we have \({\mathcal {K}}({q})\models {{\mathsf {key}}({J})}\rightarrow {{\mathsf {key}}({J^{\prime }})}\). It can now be easily seen that \({\mathcal {K}}({q})\models {{\mathsf {key}}({G})}\rightarrow {{\mathsf {key}}({H})}\). Finally, by property (a), \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({G})}\rightarrow {{\mathsf {key}}({H})}\).

We know by [21, Lemma 3.6] that if the attack graph contains a strong cycle, then it contains a strong cycle of length 2. Therefore, to conclude the proof, it suffices to show that every cycle of length 2 in the attack graph of \(q^{\prime }\cup p\) is weak. To this end, assume that the attack graph of \(q^{\prime }\cup p\) contains an attack cycle \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\). Then, either HT or JT (or both). We assume without loss of generality that HT. We show that the attack cycle \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) is weak. We distinguish three cases.

Case that \(H\stackrel {q^{\prime }\cup p}{\not \rightsquigarrow }T\) (therefore JT) and \(J\stackrel {q^{\prime }\cup p}{\not \rightsquigarrow }T\).:

Then no witness for \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\) or \(J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) can contain T. By property (b), \(H\stackrel {q}{\rightsquigarrow }J\stackrel {q}{\rightsquigarrow }H\). Since the attack graph of q contains no strong attack cycle, \({\mathcal {K}}({q})\models {{\mathsf {key}}({H})}\rightarrow {{\mathsf {key}}({J})}\) and \({\mathcal {K}}({q})\models {{\mathsf {key}}({J})}\rightarrow {{\mathsf {key}}({H})}\). Then, by property (a), \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({H})}\rightarrow {{\mathsf {key}}({J})}\) and \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({J})}\rightarrow {{\mathsf {key}}({H})}\). It follows that the attack cycle \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) is weak.

Case that \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\).:

By property (c), \(H\in \mathcal {S}\). We distinguish two cases.

Case that J = T.:

By property (d), \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({H})}\rightarrow {{\mathsf {key}}({F_{0}})}\) and \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({F_{0}})}\rightarrow {{\mathsf {key}}({H})}\). In the following, recall that {u} = key(T). Since \(\mathcal {K}({q^{\prime }\cup p})\models \mathsf {key}(F_{0}) \rightarrow u\) and \(\mathcal {K}({q^{\prime }\cup p})\models u \rightarrow \mathsf {key}(F_{0})\) hold by the construction of \(q^{\prime }\cup p\), we conclude \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({H})}\rightarrow {u}\) and \({\mathcal {K}}({q^{\prime }\cup p})\) \(\models {u}\rightarrow {{\mathsf {key}}({H})}\). It follows that the attack cycle \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) is weak.

Case that JT.:

We show that \(J\in \mathcal {S}\) by distinguishing two cases:

  • If \(J\stackrel {q^{\prime }\cup p}{\not \rightsquigarrow }T\), then no witness for \(J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) contains T. Then, by property (b), any witness for \(J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) is also a witness for \(J\overset {q}{\rightsquigarrow }H\), and therefore \(J\in \mathcal {S}\).

  • If \(J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\), then \(J\in \mathcal {S}\) by property (c).

From \(H,J\in \mathcal {S}\), it follows \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({H})}\rightarrow {{\mathsf {key}}({J})}\) and \({\mathcal {K}}({q^{\prime }\cup p})\models {{\mathsf {key}}({J})}\rightarrow {{\mathsf {key}}({H})}\) by property (d). It follows that the attack cycle \(H\stackrel {q^{\prime }\cup p}{\rightsquigarrow }J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }H\) is weak.

Case that \(J\stackrel {q^{\prime }\cup p}{\rightsquigarrow }T\) (therefore JT).:

This case is symmetrical to a case that has already been treated.

Appendix D: Proofs of Section 8.1

1.1 D.1 Proof of Lemma 11

We will use two helping lemmas.

Lemma 16

[35, Lemma 4.3] Let q be a self-join-free Boolean conjunctive query, and r a consistent database. If α1,α2 are valuations over vars(q) such that \(\alpha _{1}(q)\subseteq {\mathbf {r}}\) and \(\alpha _{2}(q)\subseteq {\mathbf {r}}\), then {α1,α2} satisfies every functional dependency in \({\mathcal {K}}({q})\).

Lemma 17

Let q be a query in sjfBCQ. Let \({Z}\rightarrow {w}\) be a functional dependency that is internal to q. Let \(\vec {z}\) be a sequence of distinct variables such that \(\mathsf {vars}({\vec {z}})=Z\). Let \(q^{\prime }=q\cup \{N^{\mathsf {c}}(\underline {\vec {z}},w)\}\) where N is a fresh relation name of mode c. Then,

  1. 1.

    there exists a first-order reduction from CERTAINTY(q) to \({\mathsf {CERTAINTY}}({q^{\prime }})\); and

  2. 2.

    if the attack graph of q contains no strong cycle, then the attack graph of \(q^{\prime }\) contains no strong cycle.

Proof Proof of the first item

By the second condition in Definition 8, we can assume an atom Fq such that \(Z\subseteq {\mathsf {vars}}({F})\). Let \(F_{1},F_{2},\dots ,F_{\ell }\) be a sequential proof for \({\mathcal {K}}({q})\models {Z}\rightarrow {w}\) such that for every \(i\in \{1,\dots ,\ell \}\), for every uZ ∪{w}, \(F_{i}\stackrel {q}{\not \rightsquigarrow }u\). It can be easily seen that for every \(i\in \{0,\dots ,\ell -1\}\), we have

$$ {\mathcal{K}}({\{F_{j}\}_{j=1}^{i}})\models{Z}\rightarrow{{\mathsf{key}}({F_{i+1}})}. $$
(4)

Let db be a database that is the input to CERTAINTY(q). We repeat the following “purification” step: If for two valuations over vars(q), denoted β1 and β2, we have \(\beta _{1}(q),\beta _{2}(q)\subseteq \mathbf {db}\) and \(\{\beta _{1},\beta _{2}\}\not \models {Z}\rightarrow {w}\), then we remove both the F-block containing β1(F) and the F-block containing β2(F). Note that β1(F) and β2(F) may be key-equal, and hence belong to the same F-block.

Assume that we apply this step on \(\mathbf {db}^{\prime }\) and obtain \(\mathbf {db}^{\prime \prime }\). We show that some repair of \(\mathbf {db}^{\prime }\) falsifies q if and only if some repair of \(\mathbf {db}^{\prime \prime }\) falsifies q. The ⇒-direction trivially holds true. For the ⇐=-direction, let \({\mathbf {r}}^{\prime \prime }\) be a repair of \(\mathbf {db}^{\prime \prime }\) that falsifies q. Assume, toward a contradiction, that every repair of \(\mathbf {db}^{\prime }\) satisfies q. For every repair r, define Reify(r) as the set of valuations over Z ∪{w} containing 𝜃 if r𝜃(q). Let

$$ {\mathbf{r}}^{\prime}= \left\{ \begin{array}{ll} {\mathbf{r}}^{\prime\prime}\cup\{\beta_{j}(F)\} \text{\ for some} j\in\{1,2\}& \text{if } \beta_{1}(F) \text{ and } \beta_{2}(F) \text{ are key-equal}\\ {\mathbf{r}}^{\prime\prime}\cup\{\beta_{1}(F),\beta_{2}(F)\} & \text{otherwise} \end{array} \right. $$

Note that if β1(F) and β2(F) are key-equal, then we can choose either \({\mathbf {r}}^{\prime }={\mathbf {r}}^{\prime \prime }\cup \{\beta _{1}(F)\}\) or \({\mathbf {r}}^{\prime }={\mathbf {r}}^{\prime \prime }\cup \{\beta _{2}(F)\}\); the actual choice does not matter. Obviously, \({\mathbf {r}}^{\prime }\) is a repair of \(\mathbf {db}^{\prime }\). Since we assumed that every repair of \(\mathbf {db}^{\prime }\) satisfies q, we can assume a valuation α over vars(q) such that \(\alpha (q)\subseteq {\mathbf {r}}^{\prime }\). Since \(\alpha (q)\nsubseteq {\mathbf {r}}^{\prime \prime }\) (because \({\mathbf {r}}^{\prime \prime }\not \models q\)), it must be the case that for some j ∈{1, 2}, α(F) = βj(F). From \(\mathsf {vars}({\vec {z}})=Z\subseteq {\mathsf {vars}}({F})\), it follows that \(\alpha (\vec {z})=\beta _{j}(\vec {z})\). From \(\beta _{1}(\vec {z})=\beta _{2}(\vec {z})\), it follows \(\alpha (\vec {z})=\beta _{1}(\vec {z})\) and \(\alpha (\vec {z})=\beta _{2}(\vec {z})\). Since β1(w)≠β2(w), either α(w)≠β1(w) or α(w)≠β2(w) (or both). Therefore, we can assume b ∈{1, 2} such that α(w)≠βb(w). It will be the case that \({\textsf {Reify}}({{\mathbf {r}}^{\prime }})=\{\alpha [Z\cup \{w\}]\}\).Footnote 2 Indeed, since α is an arbitrary valuation over vars(q) such that \(\alpha (q)\subseteq {\mathbf {r}}^{\prime }\), it follows that for all valuations α1,α2 over vars(q), if \(\alpha _{1}(q),\alpha _{2}(q)\subseteq {\mathbf {r}}^{\prime }\), then \(\alpha _{1}(\vec {z})=\alpha _{2}(\vec {z})\) and therefore, by Lemma 16 and using that \({\mathcal {K}}({q})\models {Z}\rightarrow {w}\), we have α1(w) = α2(w).

We now claim that for all \(i\in \{0,1,\dots ,\ell \}\), there exists a pair \(({\mathbf {r}}^{\prime i},\alpha ^{i})\) such that

  1. 1.

    \({\mathbf {r}}^{\prime i}\) is a repair of \(\mathbf {db}^{\prime }\);

  2. 2.

    αi is a valuation over vars(q) such that \(\alpha ^{i}(q)\subseteq {\mathbf {r}}^{\prime i}\);

  3. 3.

    \(\alpha ^{i}(\{F_{j}\}_{j=1}^{i})=\beta _{b}(\{F_{j}\}_{j=1}^{i})\) and \(\alpha ^{i}(\vec {z})=\beta _{b}(\vec {z})\) (and therefore \(\alpha ^{i}(\vec {z})=\alpha (\vec {z})\));

  4. 4.

    αi(w) = α(w); and

  5. 5.

    \({\textsf {Reify}}({{\mathbf {r}}^{\prime i}})=\{\alpha [Z\cup \{w\}]\}\).

The third condition entails \(\{\alpha ^{i},\beta _{b}\}\models {\mathcal {K}}({\{F_{j}\}_{j=1}^{i}})\) for all \(i\in \{0,1,\dots ,\ell \}\). From (4), it follows \(\{\alpha ^{i},\beta _{b}\}\models {Z}\rightarrow {{\mathsf {key}}({F_{i+1}})}\). Then, from \(\alpha ^{i}(\vec {z})=\beta _{b}(\vec {z})\), it follows that αi and βb agree on all variables of key(Fi+ 1).

The proof of the above claim runs by induction on increasing i. For the basis of the induction, i = 0, the desired result holds by choosing \({\mathbf {r}}^{\prime 0}={\mathbf {r}}^{\prime }\) and α0 = α.

For the induction step, \(i\rightarrow i+1\), the induction hypothesis is that the desired pair \(({\mathbf {r}}^{\prime i},\alpha ^{i})\) exists. Since αi and βb agree on all variables of key(Fi+ 1), we have that αi(Fi+ 1) and βb(Fi+ 1) are key-equal. From \(\beta _{b}(q)\subseteq \mathbf {db}^{\prime }\), it follows that \(\beta _{b}(F_{i+1})\in \mathbf {db}^{\prime }\). Let \({\mathbf {r}}^{\prime i+1}=\left ({{\mathbf {r}}^{\prime i}\setminus \{\alpha ^{i}(F_{i+1})\}}\right )\cup \{\beta _{b}(F_{i+1})\}\), which is obviously a repair of \(\mathbf {db}^{\prime }\). Since \(F_{i+1}\stackrel {q}{\not \rightsquigarrow }u\) for all uZ ∪{w}, \({\textsf {Reify}}({{\mathbf {r}}^{\prime i+1}})\subseteq {\textsf {Reify}}({{\mathbf {r}}^{\prime i}})\) by [21, Lemma B.1]. Since we assumed that every repair of \(\mathbf {db}^{\prime }\) satisfies q, we have that \({\textsf {Reify}}({{\mathbf {r}}^{\prime i+1}})\neq \emptyset \), and therefore \(\textsf {Reify}({{\mathbf {r}}^{\prime i+1}})=\{\alpha [Z\cup \{w\}]\}\). Hence, there exists a valuation αi+ 1 over vars(q) such that \(\alpha ^{i+1}(q)\subseteq {\mathbf {r}}^{\prime i+1}\) and αi+ 1[Z ∪{w}] = α[Z ∪{w}], that is, \(\alpha ^{i+1}(\vec {z})=\alpha (\vec {z})\) and αi+ 1(w) = α(w). Since \(\alpha (\vec {z})=\beta _{b}(\vec {z})\), we have \(\alpha ^{i+1}(\vec {z})=\beta _{b}(\vec {z})\). We have thus shown that the pair \(({\mathbf {r}}^{\prime i+1},\alpha ^{i+1})\) satisfies items 1, 2, 4, and 5 in the above five-item list; we also have shown the second conjunct of item 3. In the next paragraph, we show that \(\alpha ^{i+1}(\{F_{j}\}_{j=1}^{i+1})=\beta _{b}(\{F_{j}\}_{j=1}^{i+1})\), i.e., the first conjunct of item 3.

By the induction hypothesis, \(\alpha ^{i}(\{F_{j}\}_{j=1}^{i})=\beta _{b}(\{F_{j}\}_{j=1}^{i})\) and \(\alpha ^{i}(q)\subseteq {\mathbf {r}}^{\prime i}\), which implies \(\beta _{b}(\{F_{j}\}_{j=1}^{i})\subseteq {\mathbf {r}}^{\prime i}\). Since \({\mathbf {r}}^{\prime i}\) and \({\mathbf {r}}^{\prime i+1}\) include the same set of Fj-facts for every \(j\in \{1,\dots ,i\}\), we have \(\beta _{b}(\{F_{j}\}_{j=1}^{i})\subseteq {\mathbf {r}}^{\prime i+1}\). Since \(\beta _{b}(F_{i+1})\in {\mathbf {r}}^{\prime i+1}\) by construction, we obtain \(\beta _{b}(\{F_{j}\}_{j=1}^{i+1})\subseteq {\mathbf {r}}^{\prime i+1}\). Since also \(\alpha ^{i+1}(\{F_{j}\}_{j=1}^{i+1})\subseteq {\mathbf {r}}^{\prime i+1}\) (because \(\alpha ^{i+1}(q)\subseteq {\mathbf {r}}^{\prime i+1}\)), it is correct to conclude that \(\{\beta _{b},\alpha ^{i+1}\}\models {\mathcal {K}}({\{F_{j}\}_{j=1}^{i+1}})\) by Lemma 16. We are now ready to show that αi+ 1(Fj) = βb(Fj) for all \(j\in \{1,\dots ,i+1\}\). To this end, pick any \(k\in \{1,\dots ,i+1\}\). We have \({\mathcal {K}}({\{F_{j}\}_{j=1}^{k-1}})\models {Z}\rightarrow {{\mathsf {key}}({F_{k}})}\) by (4). Since \(\{F_{j}\}_{j=1}^{k-1}\) is a subset of \(\{F_{j}\}_{j=1}^{i+1}\), we have \(\{\beta _{b},\alpha ^{i+1}\}\models {\mathcal {K}}({\{F_{j}\}_{j=1}^{k-1}})\), and therefore \(\{\beta _{b},\alpha ^{i+1}\}\models {Z}\rightarrow {{\mathsf {key}}({F_{k}})}\). Then, from \(\alpha ^{i+1}(\vec {z})=\beta _{b}(\vec {z})\) (the second conjunct of item 3), it follows that αi+ 1 and βb agree on all variables of key(Fk). Since \(\alpha ^{i+1}(F_{k}),\beta _{b}(F_{k})\in {\mathbf {r}}^{\prime i+1}\), it must be the case that αi+ 1(Fk) = βb(Fk). This concludes the induction step.

For the pair \(({\mathbf {r}}^{\prime \ell },\alpha ^{\ell })\), we have that \(\alpha ^{\ell }(\{F_{j}\}_{j=1}^{\ell })=\beta _{b}(\{F_{j}\}_{j=1}^{\ell })\), and therefore, since w occurs in some Fj, α(w) = βb(w). Since also α(w) = α(w), we obtain α(w) = βb(w), a contradiction. We conclude by contradiction that some repair of \(\mathbf {db}^{\prime }\) falsifies q. Thus, the purification step described in the paragraph immediate following (4) does not change the answer to CERTAINTY(q).

We repeat the “purification” step until it can no longer be applied. Let the final database be \(\widehat {\mathbf {db}}\). By the above reasoning, we have that every repair of \(\widehat {\mathbf {db}}\) satisfies q if and only if every repair of db satisfies q. Let s be the smallest set of N-facts containing \(N(\underline {\beta (\vec {z})},\beta (w))\) for every valuation β over vars(q) such that \(\beta (q)\subseteq \mathbf {db}\). We show that s is consistent. To this end, let β1,β2 be valuations over vars(q) such that \(\beta _{1}(q),\beta _{2}(q)\subseteq \mathbf {db}\) and \(\beta _{1}(\vec {z})=\beta _{2}(\vec {z})\). If β1(w)≠β2(w), then a purification step can remove the block containing β1(F), contradicting our assumption that no purification step is applicable on \(\widehat {\mathbf {db}}\). We conclude by contradiction that β1(w) = β2(w).

Since N has mode c and s is consistent, we have that \(\widehat {\mathbf {db}}\cup {\mathbf {s}}\) is a legal database. It can now be easily seen that every repair of db satisfies q if and only if every repair of \(\widehat {\mathbf {db}}\cup {\mathbf {s}}\) satisfies \(q^{\prime }=q\cup \{N^{\mathsf {c}}(\underline {\vec {z}},w)\}\).

It remains to be argued that the reduction is in FO, i.e., that the result of the repeated “purification” step can be obtained by a single first-order query. Let \(\mathsf {vars}({q})=\{x_{1},\dots ,x_{n}\}\). Let \(q^{*}(x_{1},\dots ,x_{n})\mathrel {\mathop :}=\bigwedge _{G\in q}G\) be the quantifier-free part of the first-order formula expressing the Boolean query q. For every \(i\in \{1,\dots ,n\}\), let \(x_{i}^{\prime }\) be a fresh variable. Let \(\vec {u}\) be a sequence of distinct variables such that \(\mathsf {vars}({\vec {u}})={\mathsf {vars}}({F})\). The following query finds all F-facts whose blocks can be removed:

$$\left \{\vec{u}\mid\exists^{*}\left({q^{*}(x_{1},\dots,x_{n})\land q^{*}(x_{1}^{\prime},\dots,x_{n}^{\prime})\land\left({\bigwedge_{z\in Z}z=z^{\prime}}\right)\land w\neq w^{\prime}}\right)\right\},$$

where the existential quantification ranges over all variables not in \(\vec {u}\). The F-facts that are to be preserved are not key-equal to a fact in the preceding query and can obviously be computed in FO. This concludes the proof of the first item.Proof of the Second Item Assume that the attack graph of q contains no strong cycle. We will show that the attack graph of \(q^{\prime }\) contains no strong cycle either. By the second item in Definition 8, we can assume an atom Gq such that \(Z\subseteq {\mathsf {vars}}({G})\). Note that the atom \(N^{\mathsf {c}}(\underline {\vec {z}},w)\) has no outgoing attacks because its mode is c. It is sufficient to show that for every F,Hq, if there exists a witness for \(F\stackrel {q^{\prime }}{\rightsquigarrow }H\), then there exists a witness for \(F\stackrel {q^{\prime }}{\rightsquigarrow }H\) that does not contain \(N^{\mathsf {c}}(\underline {\vec {z}},w)\). To this end, assume that a witness for \(F\stackrel {q^{\prime }}{\rightsquigarrow }H\) contains

$$ \dotsm F^{\prime}\stackrel{u^{\prime}}{\smallfrown}N^{\mathsf{c}}(\underline{\vec{z}},w)\stackrel{u^{\prime\prime}}{\smallfrown}F^{\prime\prime}\dotsm, $$
(5)

where \(u^{\prime }\) and \(u^{\prime \prime }\) are distinct variables. We can assume without loss of generality that this is the only occurrence of \(N^{\mathsf {c}}(\underline {\vec {z}},w)\) in the witness. In this case, we have \(F\overset {q}{\rightsquigarrow }u^{\prime }\). If \(u^{\prime },u^{\prime \prime }\in Z\), then we can replace \(N^{\mathsf {c}}(\underline {\vec {z}},w)\) with G. So the only nontrivial case is where either \(u^{\prime }=w\) or \(u^{\prime \prime }=w\) (but not both). Then, it must be the case that \({\mathcal {K}}({q^{\prime }\setminus \{F\}})\not \models {{\mathsf {key}}({F})}\rightarrow {w}\), and therefore also

$$ {\mathcal{K}}({q\setminus\{F\}})\not\models{{\mathsf{key}}({F})}\rightarrow{w}. $$
(6)

Since \({Z}\rightarrow {w}\) is internal to q, there exists a sequential proof for \({\mathcal {K}}({q})\models {Z}\rightarrow {w}\) such that no atom in the proof attacks a variable in Z ∪{w}. Let \(J_{1},J_{2},\dots ,J_{\ell }\) be a shortest such proof. Because \(F\overset {q}{\rightsquigarrow }u^{\prime }\) and \(u^{\prime } \in Z \cup \{w\}\), it must be that \(F\not \in \{J_{1},\dots ,J_{\ell }\}\). We can assume that w occurs at a non-primary-key position in J. Because of (6), we can assume the existence of a variable vkey(J) such that \({\mathcal {K}}({q\setminus \{F\}})\not \models {{\mathsf {key}}({F})}\rightarrow {v}\). If vZ, then there exists k < such that v occurs at a non-primary-key position in Jk. Again, we can assume a variable \(v^{\prime }\in {\mathsf {key}}({J_{k}})\) such that \({\mathcal {K}}({q\setminus \{F\}})\not \models {{\mathsf {key}}({F})}\rightarrow {v^{\prime }}\). By repeating the same reasoning, there exists a sequence

$$ \stackrel{z_{i_{0}}}{\smallfrown}J_{i_{0}} \stackrel{z_{i_{1}}}{\smallfrown}J_{i_{1}} \stackrel{z_{i_{2}}}{\smallfrown} {\dots} \stackrel{z_{i_{m}}}{\smallfrown}J_{i_{m}} \stackrel{w}{\smallfrown} $$

where \(1\leq i_{0}<i_{1}<\dotsm <i_{m}=\ell \) such that

  • \(z_{i_{0}}\in Z\);

  • for all \(j\in \{0,\dots ,m\}\), \({\mathcal {K}}({q\setminus \{F\}})\not \models {{\mathsf {key}}({F})}\rightarrow {z_{i_{j}}}\); and

  • for all \(j\in \{1,\dots ,m\}\), \(z_{i_{j}}\in {\mathsf {vars}}({J_{i_{j-1}}})\cap {\mathsf {vars}}({J_{i_{j}}})\). In particular, \(z_{i_{j}}\in {\mathsf {key}}({J_{i_{j}}})\).

We can assume Gq such that \(Z\subseteq {\mathsf {vars}}({G})\). Let \(u\in \{u^{\prime },u^{\prime \prime }\}\) such that uw. Thus, \(\{u,w\}=\{u^{\prime },u^{\prime \prime }\}\). It can now be easily seen that a witness for \(F\stackrel {q^{\prime }}{\rightsquigarrow }H\) can be obtained by replacing \(N^{\mathsf {c}}(\underline {\vec {z}},w)\) in (5) with the following sequence or its reverse:

$$ \stackrel{u}{\smallfrown}G \stackrel{z_{i_{0}}}{\smallfrown}J_{i_{0}} \stackrel{z_{i_{1}}}{\smallfrown}J_{i_{1}} \stackrel{z_{i_{2}}}{\smallfrown} {\dots} \stackrel{z_{i_{m}}}{\smallfrown}J_{i_{m}} \stackrel{w}{\smallfrown} $$

This concludes the proof of Lemma 17. □

The proof of Lemma 11 is now straightforward.

Proof Proof of Lemma 11

Repeated application of Lemma 17. □

1.2 D.2 Proof of Lemma 12

We will use the following helping lemma.

Lemma 18

Let q be a query in sjfBCQ such that q is saturated and the attack graph of q contains no strong cycle. Let \(\mathcal {S}\) be an initial strong component in the attack graph of q with \(\left |{\mathcal {S}}\right |\geq 2\). For every atom \(F \in \mathcal {S}\), there exists an atom \(H \in \mathcal {S}\) such that FMH.

Proof

Assume \(F \in \mathcal {S}\). Since F belongs to an initial strong component with at least two atoms, there exists \(G \in \mathcal {S}\) such that \(F\overset {q}{\rightsquigarrow }G\) and the attack is weak. Therefore, \({\mathcal {K}}({q})\models {{\mathsf {key}}({F})}\rightarrow {{\mathsf {key}}({G})}\). It follows that \({\mathcal {K}}({q\setminus \{F\}})\models {{\mathsf {vars}}({F})}\rightarrow {{\mathsf {key}}({G})}\). Let \(\sigma = H_{1}, H_{2}, \dots , H_{\ell }\) be a sequential proof for \({\mathcal {K}}({q\setminus \{F\}})\models {{\mathsf {vars}}({F})}\rightarrow {{\mathsf {key}}({G})}\), where \(F \notin \{H_{1}, \dots , H_{\ell }\}\). We can assume without loss of generality that H = G.

Let j be the smallest index in \(\{1, \dots , \ell \}\) such that \(H_{j} \in \mathcal {S}\). Since \(H_{\ell } \in \mathcal {S}\), such an index always exists. Then, \(\sigma = H_{1}, H_{2}, \dots , H_{j-1}\) is a sequential proof for \({\mathcal {K}}({q\setminus \{F\}})\models {{\mathsf {vars}}({F})}\rightarrow {{\mathsf {key}}({H_{j}})}\) (observe that this proof may be empty). By our choice of j, for every \(i\in \{1,\dots ,j-1\}\), we have \(H_{i} \notin \mathcal {S}\), and hence Hi cannot attack F or Hj (since \(\mathcal {S}\) is an initial strong component). It follows that no atom in σ attacks a variable in vars(F) ∪key(Hj). Since q is saturated, this implies that \({\mathcal {K}}({{q}^{\mathsf {cons}}})\models {{\mathsf {vars}}({F})}\rightarrow {{\mathsf {key}}({H_{j}})}\), and so FMHj. □

The proof of Lemma 12 can now be given.

Proof Proof of Lemma 12

Starting from some atom \(F_{0} \in \mathcal {S}\), by applying repeatedly Lemma 18, we can create an infinite sequence \(F_{0} \stackrel {\mathsf {{~}_{M}}}{\longrightarrow } F_{1} \stackrel {\mathsf {{~}_{M}}}{\longrightarrow } F_{2} \stackrel {\mathsf {{~}_{M}}}{\longrightarrow } \dotsm \) such that for every i ≥ 1, \(F_{i} \in \mathcal {S}\) and FiFi+ 1. Since the atoms in \(\mathcal {S}\) are finitely many, there will exist some i,j such that i < j and Fi = Fj+ 1. It follows that the M-graph of q contains a cycle all of whose atoms belong to \(\mathcal {S}\). □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koutris, P., Wijsen, J. Consistent Query Answering for Primary Keys in Datalog. Theory Comput Syst 65, 122–178 (2021). https://doi.org/10.1007/s00224-020-09985-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-020-09985-6

Keywords

Navigation