Skip to main content
Log in

Approximation and inapproximability results on computing optimal repairs

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Computing optimal subset repairs and optimal update repairs of an inconsistent database has a wide range of applications and is becoming standalone research problems. However, these problems have not been well studied in terms of both inapproximability and approximation algorithms. In this paper, we prove a new tighter inapproximability bound for computing optimal subset repairs. We show that it is frequently NP-hard to approximate an optimal subset repair within a factor better than 143/136. We develop an algorithm for computing optimal subset repairs with an approximation ratio \((2-1/2^{\sigma -1})\), where \(\sigma \) is the number of functional dependencies. We improve it when the database contains a large amount of quasi-Turán clusters. We then extend our work for computing optimal update repairs. We show it is NP-hard to approximate an optimal update repair within a factor better than 143/136 for representative cases. We further develop an approximation algorithm for computing optimal update repairs with an approximation ratio mlc(\({\Sigma }\))\((2-1/2^{\sigma -1})\), where mlc(\({\Sigma }\)) depends on the given functional dependencies. We conduct experiments on real data to examine the performance and the effectiveness of our proposed approximation algorithms

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Fact-wise reduction is a kind of strict reduction that is formally defined in [49]. For any \(\epsilon >0\), if problem B has a \((1+\epsilon )\)-approximation, then problem A has a \((1+\epsilon )\)-approximation whenever there is a fact-wise reduction from A to B.

  2. https://data.world/datafiniti/consumer-reviews-of-amazon-products

  3. https://data.world/datafiniti/grammar-and-online-product-reviews

  4. http://www.geonames.org/

  5. http://results.openaddresses.io/

  6. https://dblp.org/xml/

  7. https://www.gnu.org/software/glpk/

  8. An implicant of an attribute A is a set X of attributes such that \(X\rightarrow A\) can be derived from \({\Sigma }\). A core implicant of A is a minimal set C of attributes that hits every implicant of A (i.e., \(X\cap C\ne \varnothing \) for each implicant C of A). A minimum core implicant of A is a core implicant of A with the smallest cardinality.

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  2. Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)

  3. Amini, O., Pérennes, S., Sau, I.: Hardness and approximation of traffic grooming. Theor. Comput. Sci. 410(38–40), 3751–3760 (2009)

    Article  MATH  Google Scholar 

  4. Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)

  5. Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases. Theor. Pract. Log. Prog. 3(4), 393–424 (2003)

    Article  MATH  Google Scholar 

  6. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003)

    Article  MATH  Google Scholar 

  7. Assadi, A., Milo, T., Novgorodov, S.: \(\text{DANCE}\): data cleaning with constraints and experts. In: ICDE, pp. 1409–1410 (2017)

  8. Bar-Yehuda, R., Even, S.: A linear-time approximation algorithm for the weighted vertex cover problem. J. Algorithms 2(2), 198–203 (1981)

    Article  MATH  Google Scholar 

  9. Bellare, M., Goldwasser, S., Lund, C., Russeli, A.: Efficient probabilistically checkable proofs and applications to approximations. In: STOC, pp. 294–304 (1993)

  10. Bergman, M., Milo, T., Novgorodov, S., Tan, W.C.: \(\text{ QOCO }\): a query oriented data cleaning system with oracles. PVLDB 8(12), 1900–1903 (2015)

    Google Scholar 

  11. Bertossi, L.: Database repairs and consistent query answering: origins and further developments. In: PODS, pp. 48–58 (2019)

  12. Bertossi, L.: Repair-based degrees of database inconsistency. In: LPNMR, pp. 195–209 (2019)

  13. Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: Fixing numerical attributes under integrity constraints. In: Proceedings of International Symposium on Database Programming Languages (DBPL 05). Springer LNCS, vol. 3774, pp. 262–278 (2005)

  14. Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst. 33(4), 407–434 (2008)

    Article  MATH  Google Scholar 

  15. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)

  16. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

  17. Boria, N., Croce, F.D., Paschos, V.T.: On the max min vertex cover problem. Discrete Appl. Math. 196, 62–71 (2015)

    Article  MATH  Google Scholar 

  18. Caniupán, M., Bertossi, L.: The consistency extractor system: answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010)

    Article  Google Scholar 

  19. Cardinal, J., Karpinski, M., Schmied, R., Viehmann, C.: Approximating vertex cover in dense hypergraphs. J. Discrete Algorithms 13, 67–77 (2012). https://doi.org/10.1016/j.jda.2012.01.003

    Article  MATH  Google Scholar 

  20. Caruccio, L., Vincenzo, D., Polese, G.: Mining relaxed functional dependencies from data. Data Min. Knowl. Discov. (2019)

  21. Chen, J., Kanj, I.A., Xia, G.: Improved upper bounds for vertex cover. Theor. Comput. Sci. 411(40), 3736–3756 (2010)

    Article  MATH  Google Scholar 

  22. Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE, pp. 446–457 (2011)

  23. Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2) (2005)

  24. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)

  25. Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: SIGMOD, pp. 2201–2206 (2016)

  26. Chvatal, V.: A greedy heuristic for the set-covering problem. Math. Oper. Res. 4(3), 233–235 (1979). https://doi.org/10.1287/moor.4.3.233

    Article  MATH  Google Scholar 

  27. Cohen, M.B., Lee, Y.T., Song, Z.: Solving linear programs in the current matrix multiplication time. J ACM 68(1), 1–39 (2021)

    Article  MATH  Google Scholar 

  28. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. PVLDB 7(6), 315–325 (2007)

    Google Scholar 

  29. Crescenzi, P.: A short guide to approximation preserving reductions. In: CCC, pp. 262–273 (1997)

  30. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: \(\text{ NADEEF }\): a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)

  31. De Sa, C., Ilyas, I.F., Kimelfeld, B., Ré, C., Rekatsinas, T.: A formal framework for probabilistic unclean databases. In: ICDT, pp. 26–28 (2019)

  32. Dixit, A.A.: \(\text{ CAvSAT }\): a system for query answering over inconsistent databases. In: SIGMOD, pp. 1823–1825 (2019)

  33. Dixit, A.A., Kolaitis, P.G.: A \(\text{ SAT }\)-based system for consistent query answering. In: SAT, pp. 117–135 (2019)

  34. Flesca, S., Furfaro, F., Parisi, F.: Consistent query answers on numerical databases under aggregate constraints. In: DBPL, pp. 279–294 (2005)

  35. Flesca, S., Furfaro, F., Parisi, F.: Querying and repairing inconsistent numerical databases. ACM Trans. Database Syst. (2010). https://doi.org/10.1145/1735886.1735893

    Article  MATH  Google Scholar 

  36. Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: a challenging application of disjunctive logic programming. In: Logic for Programming, Artificial Intelligence, and Reasoning, pp. 561–578 (2001)

  37. Gartner.: Vendor Rating Service. https://www.gartner.com/en/research/methodolo-gies/vendor-rating. Accessed 15 May 2020

  38. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)

    Google Scholar 

  39. Golab, L., Ilyas, I.F., Beskales, G., Galiullin, A.: On the relative trust between inconsistent data and inaccurate constraints. In: ICDE, pp. 541–552 (2013)

  40. Guruswami, V., Khot, S.: Hardness of \(\text{ M }\)ax \(3\text{ SAT }\) with no mixed clauses. In: CCC, pp. 154–162 (2005)

  41. Kann, V.: Maximum bounded 3-dimensional matching is \(\text{ MAX } \text{ SNP }\)-complete. Inf. Process. Lett. 37(1), 27–35 (1991)

    Article  MATH  Google Scholar 

  42. Karakostas, G.: A better approximation ratio for the vertex cover problem. ACM Trans. Algorithms 5(4), 41:1-41:8 (2009)

    Article  MATH  Google Scholar 

  43. Khot, S.: On the unique games conjecture. In: FOCS, p. 3 (2005)

  44. Khot, S., Regev, O.: Vertex cover might be hard to approximate to within 2-\(\epsilon \). J. Comput. Syst. Sci. 74(3), 335–349 (2008)

    Article  MATH  Google Scholar 

  45. Kivinen, J., Mannila, H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129–149 (1995)

    Article  MATH  Google Scholar 

  46. Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)

  47. Kolaitis, P.G., Pema, E., Tan, W.C.: Efficient querying of inconsistent databases with binary integer programming. PVLDB 6(6), 397–408 (2013)

    Google Scholar 

  48. Koutris, P., Wijsen, J.: Consistent query answering for self-join-free conjunctive queries under primary key constraints. ACM Trans. Database Syst. 42(2), 1–45 (2017)

    Article  MATH  Google Scholar 

  49. Livshits, E., Kimelfeld, B., Roy, S.: Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45(1), 1–46 (2020)

    Article  Google Scholar 

  50. Lopatenko, A., Bertossi, L.: Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In: ICDT, pp. 179–193 (2007)

  51. Miao, D., Cai, Z., Li, J., Gao, X., Liu, X.: The computation of optimal subset repairs. Proc. VLDB Endow. 13(11), 2061–2074 (2020)

    Article  Google Scholar 

  52. Nemhauser, G.L., Trotter, L.E.: Vertex packings: structural properties and algorithms. Math. Program. 8(4), 232–248 (1975)

  53. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holo\(\text{ C }\)lean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)

    Google Scholar 

  54. Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: causal database repair for algorithmic fairness. In: SIGMOD, pp. 793–810 (2019)

  55. Wijsen, J.: Condensed representation of database repairs for consistent query answering. In: ICDT, pp. 378–393 (2003)

  56. Wijsen, J.: Database repairing using updates. In: SIGMOD, vol. 30 (2005)

  57. Wijsen, J.: On the consistent rewriting of conjunctive queries under primary key constraints. Inf. Syst. 34(7), 578–601 (2009)

    Article  Google Scholar 

  58. Wijsen, J.: Certain conjunctive query answering in first-order logic. ACM Trans. Database Syst. 37(2), 1–35 (2012)

    Article  Google Scholar 

  59. Wijsen, J.: User-guided repairing of inconsistent knowledge bases. In: Proceedings of the 21th International Conference on Extending Database Technology (2018). https://doi.org/10.5441/002/EDBT.2018.13

  60. Wijsen, J.: Foundations of query answering on inconsistent databases. SIGMOD Rec. 48(3), 6–16 (2019)

    Article  Google Scholar 

  61. Zehavi, M.: Maximum minimal vertex cover parameterized by vertex cover. SIAM J. Discrete Math. 31(4), 2440–2456 (2017)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China (NSFC) Grant Nos. 61972110, 61832003, U1811461, and U19A2059.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongjing Miao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs for OPTSR

Lemma 1

Let \(\phi \) be the input expression of problem MAX-NM-E3SAT and \(N_v\), \(N_c\) be the number of variables and clauses in \(\phi \). Let \(\mathcal {N}_{\max }(\phi )\) denote the maximum number of clauses that can be satisfied in \(\phi \). We have \(\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c\).

Proof

There are \(N_v\) variables in \(\phi \) and \(2^{N_v}\) kinds of variable assignment. For each assignment \(\tau \), the number of clauses in \(\phi \) satisfied by \(\tau \) is fixed, denoted \(\mathcal {N}(\phi )\). Note that the total amount of \(\tau \) is \(2^{N_v}\), and for each clause containing exactly 3 variables, 7/8 of assignments can satisfy the clause while 1/8 of them cannot. Therefore,

$$\begin{aligned} \sum \mathcal {N}(\phi )=\frac{7}{8}\cdot N_C \cdot 2^{N_v}. \end{aligned}$$

Then, we can see that the expectation of \(\mathcal {N}(\phi )\) is

$$\begin{aligned}{\mathbf {E}}(\mathcal {N}(\phi ))=\frac{\sum \mathcal {N}(\phi )}{2^{N_v}}=\frac{7}{8}{N_c}.\end{aligned}$$

Therefore, there exists an input expression \(\phi _{0}\) such that \(\mathcal {N}(\phi _{0})\ge \mathcal {N}_{avg}(\phi )=\frac{7}{8}N_c\). Combined with \(\mathcal {N}_{\max }(\phi ) \ge \mathcal {N}(\phi _{0})\), we can conclude that \(\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c\) \(\square \)

Lemma 2

Let FD set be one of \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\), \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\). Given an \(\epsilon >0\), it is NP-hard to compute a \((143/136-\epsilon )\)-optimal S-repair for OPTSR even if every tuple in the instance has weight 1.

Proof

Here we reduce the problem MAX-NM-E3SAT to Case 1 and Case 2. We use \(\phi \) to denote the input expression of problem MAX-NM-E3SAT, and \(N_v\), \(N_c\) to denote the number of variables and clauses in \(\phi \).

Each reduction builds \(I_\phi \) of the relation schema R(ABC), in which every tuple has weight 1, for each expression \(\phi \).

Reduction for \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\). A tuple is inserted into \(I_\phi \) for each clause \(c_i\) and each variable \(x_j\) in it as follows:

  1. 1.

    If \(c_i\) contains a positive literal of variable \(x_j\), insert \((c_i,x_j,x_j)\) into \(I_\phi \),

  2. 2.

    If \(c_i\) contains a negative literal of variable \(x_j\), insert \((c_i,x_j,{\bar{x}}_j)\) into \(I_\phi \).

In total, \(3N_c\) tuples are created. Intuitively, \(A\rightarrow {B}\) guarantees that exactly one of its three corresponding tuples survives in any S-repair once a clause is satisfied. \(B\rightarrow {C}\) guarantees that the assignment of each variable is valid, i.e., either true or false, but not both.

From the perspective of the problem OPTSR, \(A \rightarrow {B}\) guarantees that any S-repair J of \(I_\phi \) contains at most one of the three tuples with the same value \(c_i\) on attribute A, where \(1\le i\le N_c\). \(B\rightarrow {C}\) guarantees that any S-repair J of \(I_\phi \) contains either \((c_i,x_j,x_j)\) or \((c_i,x_j,{\bar{x}}_j)\) for \(1\le i\le N_c\) and \(1\le j\le N_v\).

From every S-repair J, we derive a variable assignment \(\tau \), s.t. \(\tau (x_j)=1\) if there exists a tuple \(t\in {J}\) in the form \((c_i,x_j,x_j)\), \(1 \le i \le N_c\), and \(\tau (x_j)=0\) otherwise.

\(\tau (\cdot )\) is valid such that \(\tau (x_j)\) either 0 or 1, but not both, for each \(1\le j\le N_v\), to satisfy the functional dependencies. Let \(\tau _{\max }\) be an optimal variable assignment, \(\mathcal {N}(\phi )\) denotes the number of clauses in \(\phi \) satisfied by \(\tau \), and \(\mathcal {N}_{\max }(\phi )\) denotes the number of clauses in \(\phi \) satisfied by \(\tau _{\max }\). Then, we have

$$\begin{aligned} \mathcal {N}_{\max }(\phi ) = |I_\phi | - C_s(J^*,I_{\phi }) \end{aligned}$$

and for any solution J of \(I_\phi \),

$$\begin{aligned} \mathcal {N}(\phi ) \ge |I_\phi | - C_s(J,I_{\phi }). \end{aligned}$$

Reduction for \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\). A tuple is inserted into \(I_\phi \) for each clause \(c_i\) and each variable \(x_j\) in it as follows:

  1. 1.

    If \(c_i\) contains a positive literal of variable \(x_j\), insert \((c_i,x_j,x_j)\) into \(I_\phi \),

  2. 2.

    If \(c_i\) contains a negative literal of variable \(x_j\), insert \((c_i,{\bar{x}}_j,x_j)\) into \(I_\phi \).

In total, \(3N_c\) tuples are created. Similarly, \(A\rightarrow {B}\) guarantees that exactly one of its three corresponding tuples survives in any S-repair once a clause is satisfied, and \(C\rightarrow {B}\) guarantees the consistency of variable assignment.

Similarly, from the perspective of the problem OPTSR, \(A \rightarrow {B}\) guarantees that any S-repair J of \(I_\phi \) contains at most one of the three tuples with the same value \(c_i\) on attribute A, and \(C\rightarrow {B}\) guarantees that any S-repair J of \(I_\phi \) contains either \((c_i,x_j,x_j)\) or \((c_i,{\bar{x}}_j,x_j)\) for \(1\le i\le N_c\) and \(1\le j\le N_v\).

From every S-repair J, we derive a variable assignment \(\tau \), s.t. \(\tau (x_j)=1\) if there exists a tuple \(t\in {J}\) in the form \((c_i,x_j,x_j)\), \(1 \le i \le N_c\), and \(\tau (x_j)=0\) otherwise.

\(\tau (\cdot )\) is valid such that \(\tau (x_j)\) either 0 or 1, but not both, for each \(1\le j\le N_v\), to satisfy the functional dependencies. We adopt the same definition of \(\tau _{\max }\), \(\mathcal {N}(\phi )\) and \(\mathcal {N}_{\max }(\phi )\) as the reduction for \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\). Then, we have the same formula

$$\begin{aligned} \mathcal {N}_{\max }(\phi ) = |I_\phi | - C_s(J^*,I_{\phi }) \end{aligned}$$

and for any solution J of \(I_\phi \),

$$\begin{aligned} \mathcal {N}(\phi ) \ge |I_\phi | - C_s(J,I_{\phi }). \end{aligned}$$

Reduction for \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\). A tuple is inserted into \(I_\phi \) for each clause \(c_i\) and each variable \(x_j\) in it as follows:

  1. 1.

    If \(c_i\) contains a positive literal of variable \(x_j\), insert \((c_i,1,x_j)\) into \(I_\phi \),

  2. 2.

    If \(c_i\) contains a negative literal of variable \(x_j\), insert \((c_i,0,x_j)\) into \(I_\phi \).

In total, \(3N_c\) tuples are created. \(AB\rightarrow {C}\) guarantees that exactly one of the three tuples survives in any S-repair once the corresponding clause is satisfied, and \(C\rightarrow {B}\) guarantees the consistency of variable assignment.

Similarly, from the perspective of the problem OPTSR, \(AB \rightarrow {C}\) guarantees that any S-repair J of \(I_\phi \) contains at most one of the three tuples with the same value \(c_i\) on attribute A, and \(C\rightarrow {B}\) guarantees that any S-repair J of \(I_\phi \) contains either \((c_i,1,x_j)\) or \((c_i,0,x_j)\) for \(1\le i\le N_c\) and \(1\le j\le N_v\).

From every S-repair J, we derive a variable assignment \(\tau \), s.t. \(\tau (x_j)=1\) if there exists a tuple \(t\in {J}\) in the form \((c_i,1,x_j)\), \(1 \le i \le N_c\), and \(\tau (x_j)=0\) otherwise.

\(\tau (\cdot )\) is valid such that \(\tau (x_j)\) either 0 or 1, but not both, for each \(1\le j\le N_v\), to satisfy the functional dependencies. We adopt the same definition of \(\tau _{\max }\), \(\mathcal {N}(\phi )\) and \(\mathcal {N}_{\max }(\phi )\). Then, we can get the same formula for \(\mathcal {N}(\phi )\) and \(\mathcal {N}_{\max }(\phi )\).

Deriving lower bound. We only show the process of deriving the lower bound for \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\). The processes of deriving \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\) are omitted here since the same bound can be derived in the same way.

Let \(k>1\) and J a k-optimal S-repair of \(I_\phi \) such that

$$\begin{aligned} C_s(J^*,I_{\phi }) \le C_s(J,I_{\phi }) \le k\cdot C_s(J^*,I_{\phi }), \end{aligned}$$

thus we have

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}\ge & {} \frac{|I_\phi |-C_s(J,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\nonumber \\\ge & {} \frac{|I_\phi |-k\cdot C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\nonumber \\= & {} 1+(1-k)\cdot \frac{C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}. \end{aligned}$$
(6)

Note that \(|I_\phi |=3N_c\). Due to LEMMA 1, \(\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c\). Hence,

$$\begin{aligned}C_s(J^*,I_{\phi })=|I_\phi |-\mathcal {N}_{\max }(\phi ) \le 3N_c-\frac{7}{8}N_c=\frac{17}{8}N_c. \end{aligned}$$

By applying this fact in the right hand of inequality (6), the following inequality holds

$$\begin{aligned} \frac{C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\le \frac{\frac{17}{8}N_c}{3N_c-\frac{17}{8}N_c}=\frac{17}{7}. \end{aligned}$$
(7)

We apply inequality (7) into inequality (6), then

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}>\frac{24}{7}-\frac{17}{7}k \end{aligned}$$
(8)

Inequality 8 implies the problem MAX-NM-E3SAT can be approximated with ratio \(\frac{24}{7}-\frac{17}{7}k\) if a k-optimal S-repair can be computed in polynomial time. Suppose \(k<143/136\), then \(\frac{24}{7}-\frac{17}{7}k>7/8\), which implies the MAX-NM-E3SAT problem admits an approximation with a ratio better than 7/8. This is contrary to the hardness result obtained in [40]. Therefore, we have \(k\ge 143/136\) which is at least 1.05. \(\square \)

Lemma 3

Let FD set be \({{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}\), it is NP-hard to compute a \((69246103/69246100 - \epsilon )\)-optimal S-repair for any \(\epsilon >0\), even if every tuple in the instance has weight 1.

Proof

By merging the following \(\mathcal {L}_{\alpha , \beta }\)-reductions given in previous literature [3, 41, 49],

$$\begin{aligned}\small MAX B29-3SAT&\prec _{529,1}&3DM \\ 3DM&\prec _{1,1}&MAX 3SC \\ MAX 3SC&\prec _{55,1}&MECT-B\\ MECT-B&\prec _{\frac{7}{6},1}&O{PT}SR({\mathsf {R}}, {{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}) \end{aligned}$$

we conclude that MAX B29-3SAT can be approximated within 680/679 if a \((69246103/69246100-\epsilon )\)-optimal S-repair can be computed in polynomial time for any \(\epsilon >0\) when \({\Sigma }\) is \({{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}\), which is contrary to the hardness result shown in [29]. \(\square \)

Appendix B: Proofs for OPTUR

Theorem 5

Let \({\mathsf {R}}\) be a fixed relation schema. For any finite fixed FD set \({\Sigma }\) over \({\mathsf {R}}\), the construction from instance I of \({\mathsf {R}}\) to the conflict hypergraph is an L-reduction with \(\alpha =1\), \(\beta =\frac{\sigma +2}{2}\).

Proof

From Definition 3, for any instance I of \({\mathsf {R}}\), each hyperedge in the conflict hypergraph of I represents a conflicting set of positions. Therefore, in any U-repair L of I (even optimal U-repair), at least one position in every hyperedge of \(G_{I,{\Sigma }}\) should get a new value to eliminate conflict. By denoting \({\textit{VC}}(G_{I,{\Sigma }})\) as the vertex cover cost of graph \(G_{I,{\Sigma }}\) and \({\textit{VC}}_{min}(G_{I,{\Sigma }})\) as the minimum vertex cover cost, we have

$$\begin{aligned} \textit{VC}_{min}(G_{I,{\Sigma }})\le C_{upd}(I,L*). \end{aligned}$$
(9)

On the other side, according to [46], we have

$$\begin{aligned} C_{upd}(I,L)\le \left( \frac{MCI+2}{2}\right) {\textit{VC}}(G_{I,{\Sigma }}), \end{aligned}$$

where MCI denotes the size of the largest minimum core implicantFootnote 8 over all attributes in \(attr({\Sigma })\) [46].

For each attribute C in \(attr({\Sigma })\), the minimum core implicant of C holds smaller or less cardinality than the number of functional dependencies in \({\Sigma }\), i.e., \(MCI\le \sigma \). Therefore,

$$\begin{aligned} C_{upd}(I,L)\le (\frac{\sigma +2}{2}){\textit{VC}}(G_{I,{\Sigma }}). \end{aligned}$$
(10)

From Eqs. (9) and (10), if \(\sigma \) is finite, the construction from I to the conflict hypergraph \(G_{I,{\Sigma }}\) is an L-reduction with \(\alpha =1\), \(\beta =\frac{\sigma +2}{2}\). \(\square \)

Theorem 5 shows that we can transform an update on the instance into a vertex cover of the conflict hypergraph with no extra cost, and, under finite FD set \({\Sigma }\), a vertex cover of the conflict hypergraph into an update on the instance with extra costs. In addition, as L-reduction preserves membership in NP class for, Definition 3 assists the complexity analysis for OPTUR. We establish the following consequences of theorem 5 and show the hardness of OPTUR problem in certain representative cases.

Corollary 1

Let \({\mathsf {R}}\) be a fixed relation schema and I the instance of \({\mathsf {R}}\). For a finite FD set \({\Sigma }\), if there are two FDs in \({\Sigma }\) such that \(X\rightarrow A,Y\rightarrow A,X\ne Y\), it is NP-hard to compute an optimal U-repair for I with respect to \({\Sigma }\).

Proof

We investigate the problem of vertex cover on the conflict hypergraph \(G_{I,{\Sigma }}\) constructed with I and \({\Sigma }\). From Definition 3, we can deduce that all hyperedges in \(G_{I,{\Sigma }}\) cover more than 4 vertices and less than \(2\cdot |attr({\Sigma })|^2\) vertices (which is also finite). Therefore, as there are hyperedges covering different number of vertices in \(G_{I,{\Sigma }}\), the problem of vertex cover on \(G_{I,{\Sigma }}\) is NP-hard [9, 19]. Then, we can deduce that OPTUR under \({\Sigma }\) is NP-hard. \(\square \)

Corollary 2

For an FD set \({\Sigma }\), if there are two FDs in \({\Sigma }\) such that \(X\rightarrow A,Y\rightarrow B,A\in Y,(Y-A)\cap X=\varnothing ,B\ne X\), it is NP-hard to compute an optimal U-repair for I with respect to \({\Sigma }\).

Proof

The proof is the same as that of Corollary 1, except the hyperedges in \(G_{I,{\Sigma }}\) cover more than 4 vertices and less than \(3\cdot |attr({\Sigma })|^3\) vertices (due to \(B\ne X\) and \((Y-A)\cap X=\varnothing \)). \(\square \)

As a result of Corollaries 1 and 2, we can easily deduce that OPTUR under \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\), \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\) are NP-hard. The FDs satisfying the condition of Corollary 1 for \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\) are {\(A\rightarrow B,C\rightarrow B\)}. The FDs satisfying the condition of Corollary 2 for \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\) are {\(A\rightarrow B,B\rightarrow C\)} and {\(AB\rightarrow C,C\rightarrow B\)}, respectively. Our other contribution, though not general, is that we derive the lower bounds of these example cases unknown previously.

Lemma 6

Let FD set be one of \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\), \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\). Given an \(\epsilon >0\), it is NP-hard to compute a \((143/136-\epsilon )\)-optimal U-repair for OPTUR even if every tuple in the instance has weight 1.

Proof

Here we also reduce the problem MAX-NM-E3SAT to problem OPTUR when \({\Sigma }\) is \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\), \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) or \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\). We use the same denotations \(\phi \), \(N_v\) and \(N_c\) as those in the proof of LEMMA 2.

Each reduction builds \(I_\phi \) of the relation schema R(ABC), in which every tuple has weight 1, for each expression \(\phi \).

Reduction for \({{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}\). A tuple is inserted into \(I_\phi \) for each clause \(c_i\) and each variable \(x_j\) in it as follows:

  1. 1.

    If \(c_i\) contains a positive literal of variable \(x_j\), insert \((c_i,x_j,1)\) into \(I_\phi \),

  2. 2.

    If \(c_i\) contains a negative literal of variable \(x_j\), insert \((c_i,x_j,0)\) into \(I_\phi \).

In total, \(3N_c\) tuples are created.

Intuitively, \(A\rightarrow {B}\) guarantees that for each clause, at least two of its three corresponding tuples should be updated to resolve conflict. \(B\rightarrow {C}\) guarantees the assignment of each variable is valid, i.e., either true or false, but not both. \(A\rightarrow {B}\rightarrow {C}\) guarantees:

  1. 1.

    Once a clause is satisfied, then at least two of its three corresponding tuples are updated in any U-repair, and the update with minimal cost is to change only the value on B in each tuple.

  2. 2.

    Once a clause is not satisfied, then exactly all of its three corresponding tuples are updated in any U-repair, and the update with minimal cost is to change only the value on B in each tuple.

Hence, from every U-repair L, we derive a variable assignment \(\tau \), s.t. \(\tau (x_j)=1\) if there exists a tuple \(t\in {L}\) in the form \((c_i,x_j,1)\) and \(\sum \limits _{A\in attr(I_{\phi })} \chi _{I_{\phi }.t[A],L.t[A]}\) \(=0\) (i.e., t is not changed during the update from \(I_{\phi }\) to L), \(1 \le i \le n\), and \(\tau (x_j)=0\) otherwise. \(\tau (\cdot )\) is valid such that \(\tau (x_j)\) either 0 or 1, but not both, for each \(1\le j\le N_v\), to satisfy the functional dependencies. Then, we can construct a U-repair \(L'\) in the following way, which derive the same assignment \(\tau (\cdot )\) as L, but change only the value on attribute B in corresponding tuples:

  1. 1.

    Derive the assignment \(\tau (\cdot )\) from L.

  2. 2.

    If a clause \(c_i=(x_{j1}\vee x_{j2}\vee x_{j3})\) can be satisfied by the assignment \(\tau (x_{j1})\) (Actually, in \(\{x_{j1},x_{j2},x_{j3}\}\), at least one variable’s assignment satisfy the clause \(c_i\). For convenience of discussion, we use \(x_{j1}\) to refer to a representative variable that satisfy \(c_i\)), insert \((c_i,x_{j1},1)\), \((c_i,x_{j1},1)\), \((c_i,x_{j1},1)\) to \(L'\). On the other side, if a clause \(c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})\) can be satisfied by the assignment \(\tau (x_{j1})\), insert \((c_i,x_{j1},0)\), \((c_i,x_{j1},0)\), \((c_i,x_{j1},0)\) to \(L'\).

  3. 3.

    If a clause \(c_i=(x_{j1}\vee x_{j2}\vee x_{j3})\) cannot be satisfied by all of the assignment \(\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})\), select a variable \(x_k\) such that \(\tau (x_k)=1\) and insert \((c_i,x_{k},1)\), \((c_i,x_{k},1)\), \((c_i,x_{k},1)\) to \(L'\), or insert \((c_i,1,1)\), \((c_i,1,1)\), \((c_i,1,1)\) instead if no variable is assigned to 1. On the other side, if a clause \(c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})\) cannot be satisfied by the assignment \(\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})\), select a variable \(x_k\) such that \(\tau (x_k)=0\) and insert \((c_i,x_{k},0)\), \((c_i,x_{k},0)\), \((c_i,x_{k},0)\) to \(L'\), or insert \((c_i,0,0)\), \((c_i,0,0)\), \((c_i,0,0)\) instead if no variable is assigned to 0.

Intuitively, among all update repairs that derive the assignment \(\tau (\cdot )\), the cost \(C_u(L',I_{\phi })\) is minimal. Let \(\tau _{\max }\) be an optimal variable assignment. \(\mathcal {N}(\phi )\) denotes the number of clauses in \(\phi \) satisfied by \(\tau \), and \(\mathcal {N}_{\max }(\phi )\) denotes the number of clauses in \(\phi \) satisfied by \(\tau _{\max }\). All denotations are the same as the proof in LEMMA 2. Then, we have

$$\begin{aligned}C_u(L,I_{\phi })\ge C_u(L',I_{\phi })=2\mathcal {N}(\phi )+3(N_c-\mathcal {N}(\phi )),\end{aligned}$$

which is equivalent to

$$\begin{aligned} \mathcal {N}(\phi )\ge 3N_c-C_u(L,I_{\phi }),\end{aligned}$$

We can conclude that \(L'\) is the optimal U-repair of \(I_{\phi }\) if and only if the number of clauses in \(\phi \) satisfied by \(\tau \) is maximal, i.e., \(\mathcal {N}(\phi )=\mathcal {N}_{max}(\phi )\). Therefore,

$$\begin{aligned}\mathcal {N}_{max}(\phi )= 3N_c-C_u(L^*,I_{\phi }).\end{aligned}$$

Reduction for \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\). The process of reduction for \({{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}\) and \({{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}\) is abbreviated since the same result can be derived in the same way. We use the same denotations as previous.

A tuple is inserted into \(I_\phi \) for each clause \(c_i\) and each variable \(x_j\) in it as follows:

  1. 1.

    If \(c_i\) contains a positive literal of variable \(x_j\), insert \((c_i,1,x_j)\) into \(I_\phi \),

  2. 2.

    If \(c_i\) contains a negative literal of variable \(x_j\), insert \((c_i,0,x_j)\) into \(I_\phi \).

In total, \(3N_c\) tuples are created.

From every U-repair L, we derive a variable assignment \(\tau \), s.t. \(\tau (x_j)=1\) if there exists a tuple \(t\in {L}\) in the form \((c_i,1,x_j)\) and \(\sum \limits _{A\in attr(I_{\phi })} \chi _{I_{\phi }.t[A],L.t[A]}=0\) (i.e., t is not changed during the update from \(I_{\phi }\) to L), \(1 \le i \le N_v\), and \(\tau (x_j)=0\) otherwise.

Then, we can construct a U-repair \(L'\) in the following way, which derive the same assignment \(\tau (\cdot )\) as L, but change only the value on attribute B in corresponding tuples:

  1. 1.

    Derive assignment \(\tau (\cdot )\) from L.

  2. 2.

    If a clause \(c_i=(x_{j1}\vee x_{j2}\vee x_{j3})\) can be satisfied by assignment \(\tau (x_{j1})\), insert \((c_i,1,x_{j1})\), \((c_i,1,x_{j1})\), \((c_i,1,x_{j1})\) to \(L'\). On the other side, if a clause \(c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})\) can be satisfied by assignment \(\tau (x_{j1})\), insert \((c_i,0,x_{j1})\), \((c_i,0,x_{j1})\), \((c_i,0,x_{j1})\) to \(L'\).

  3. 3.

    If a clause \(c_i=(x_{j1}\vee x_{j2}\vee x_{j3})\) cannot be satisfied by assignment \(\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})\), select a variable \(x_k\) such that \(\tau (x_k)=1\) and insert \((c_i,1,x_{k})\), \((c_i,1,x_{k})\), \((c_i,1,x_{k})\) to \(L'\), or insert \((c_i,1,1)\), \((c_i,1,1)\), \((c_i,1,1)\) instead if no variable is assigned to 1. On the other side, if a clause \(c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})\) cannot be satisfied by assignment \(\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})\), select a variable \(x_k\) such that \(\tau (x_k)=0\) and insert \((c_i,0,x_{k})\), \((c_i,0,x_{k})\), \((c_i,0,x_{k})\) to \(L'\), or insert \((c_i,0,0)\), \((c_i,0,0)\), \((c_i,0,0)\) instead if no variable is assigned to 0.

Intuitively, among all update repairs that derive the assignment \(\tau (\cdot )\), the cost \(C_u(L',I_{\phi })\) is minimal. Therefore, we have \(\mathcal {N}(\phi )\ge 3N_c-C_u(L,I_{\phi })\) and \(\mathcal {N}_{max}(\phi )= 3N_c-C_u(L^*,I_{\phi })\).

Deriving lower bound. As we can see, the formulas of \(\mathcal {N}(\phi )\) and \(\mathcal {N}_{\max }(\phi )\) are the same as those in the proof of LEMMA 2. Similarly,

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}>\frac{24}{7}-\frac{17}{7}k \end{aligned}$$
(11)

and we can conclude that the problem MAX-NM-E3SAT can be approximated with ratio \(\frac{24}{7}-\frac{17}{7}k\) if a k-optimal U-repair can be computed in polynomial time. Suppose \(k<143/136\), then \(\frac{24}{7}-\frac{17}{7}k>7/8\), which implies the MAX-NM-E3SAT problem admits an approximation with ratio better than 7/8. This contraries to the hardness result obtained in [40]. Therefore, we have \(k\ge 143/136\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miao, D., Zhang, P., Li, J. et al. Approximation and inapproximability results on computing optimal repairs. The VLDB Journal 32, 173–197 (2023). https://doi.org/10.1007/s00778-022-00738-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00738-0

Keywords

Navigation