Approximation and inapproximability results on computing optimal repairs

Miao, Dongjing; Zhang, Pengfei; Li, Jianzhong; Wang, Ye; Cai, Zhipeng

doi:10.1007/s00778-022-00738-0

Approximation and inapproximability results on computing optimal repairs

Regular Paper
Published: 12 April 2022

Volume 32, pages 173–197, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Dongjing Miao ORCID: orcid.org/0000-0001-9370-7088¹,
Pengfei Zhang²,
Jianzhong Li³,
Ye Wang⁴ &
…
Zhipeng Cai⁵

492 Accesses
1 Citation
Explore all metrics

Abstract

Computing optimal subset repairs and optimal update repairs of an inconsistent database has a wide range of applications and is becoming standalone research problems. However, these problems have not been well studied in terms of both inapproximability and approximation algorithms. In this paper, we prove a new tighter inapproximability bound for computing optimal subset repairs. We show that it is frequently NP-hard to approximate an optimal subset repair within a factor better than 143/136. We develop an algorithm for computing optimal subset repairs with an approximation ratio $(2-1/2^{\sigma -1})$, where $\sigma $ is the number of functional dependencies. We improve it when the database contains a large amount of quasi-Turán clusters. We then extend our work for computing optimal update repairs. We show it is NP-hard to approximate an optimal update repair within a factor better than 143/136 for representative cases. We further develop an approximation algorithm for computing optimal update repairs with an approximation ratio mlc(${\Sigma }$)$(2-1/2^{\sigma -1})$, where mlc(${\Sigma }$) depends on the given functional dependencies. We conduct experiments on real data to examine the performance and the effectiveness of our proposed approximation algorithms

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Query Feedback Restricted Repairs of Functional Dependency Violations: Complexity and Algorithm

On the complexity and approximability of repair position selection problem

Article 21 November 2018

Repair Position Selection for Inconsistent Data

Notes

Fact-wise reduction is a kind of strict reduction that is formally defined in [49]. For any $\epsilon >0$, if problem B has a $(1+\epsilon )$-approximation, then problem A has a $(1+\epsilon )$-approximation whenever there is a fact-wise reduction from A to B.
https://data.world/datafiniti/consumer-reviews-of-amazon-products
https://data.world/datafiniti/grammar-and-online-product-reviews
http://www.geonames.org/
http://results.openaddresses.io/
https://dblp.org/xml/
https://www.gnu.org/software/glpk/
An implicant of an attribute A is a set X of attributes such that $X\rightarrow A$ can be derived from ${\Sigma }$. A core implicant of A is a minimal set C of attributes that hits every implicant of A (i.e., $X\cap C\ne \varnothing $ for each implicant C of A). A minimum core implicant of A is a core implicant of A with the smallest cardinality.

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley, Boston (1995)
MATH Google Scholar
Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)
Amini, O., Pérennes, S., Sau, I.: Hardness and approximation of traffic grooming. Theor. Comput. Sci. 410(38–40), 3751–3760 (2009)
Article MATH Google Scholar
Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)
Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases. Theor. Pract. Log. Prog. 3(4), 393–424 (2003)
Article MATH Google Scholar
Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003)
Article MATH Google Scholar
Assadi, A., Milo, T., Novgorodov, S.: $\text{DANCE}$: data cleaning with constraints and experts. In: ICDE, pp. 1409–1410 (2017)
Bar-Yehuda, R., Even, S.: A linear-time approximation algorithm for the weighted vertex cover problem. J. Algorithms 2(2), 198–203 (1981)
Article MATH Google Scholar
Bellare, M., Goldwasser, S., Lund, C., Russeli, A.: Efficient probabilistically checkable proofs and applications to approximations. In: STOC, pp. 294–304 (1993)
Bergman, M., Milo, T., Novgorodov, S., Tan, W.C.: $\text{ QOCO }$: a query oriented data cleaning system with oracles. PVLDB 8(12), 1900–1903 (2015)
Google Scholar
Bertossi, L.: Database repairs and consistent query answering: origins and further developments. In: PODS, pp. 48–58 (2019)
Bertossi, L.: Repair-based degrees of database inconsistency. In: LPNMR, pp. 195–209 (2019)
Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: Fixing numerical attributes under integrity constraints. In: Proceedings of International Symposium on Database Programming Languages (DBPL 05). Springer LNCS, vol. 3774, pp. 262–278 (2005)
Bertossi, L., Bravo, L., Franconi, E., Lopatenko, A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst. 33(4), 407–434 (2008)
Article MATH Google Scholar
Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Boria, N., Croce, F.D., Paschos, V.T.: On the max min vertex cover problem. Discrete Appl. Math. 196, 62–71 (2015)
Article MATH Google Scholar
Caniupán, M., Bertossi, L.: The consistency extractor system: answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010)
Article Google Scholar
Cardinal, J., Karpinski, M., Schmied, R., Viehmann, C.: Approximating vertex cover in dense hypergraphs. J. Discrete Algorithms 13, 67–77 (2012). https://doi.org/10.1016/j.jda.2012.01.003
Article MATH Google Scholar
Caruccio, L., Vincenzo, D., Polese, G.: Mining relaxed functional dependencies from data. Data Min. Knowl. Discov. (2019)
Chen, J., Kanj, I.A., Xia, G.: Improved upper bounds for vertex cover. Theor. Comput. Sci. 411(40), 3736–3756 (2010)
Article MATH Google Scholar
Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE, pp. 446–457 (2011)
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2) (2005)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: SIGMOD, pp. 2201–2206 (2016)
Chvatal, V.: A greedy heuristic for the set-covering problem. Math. Oper. Res. 4(3), 233–235 (1979). https://doi.org/10.1287/moor.4.3.233
Article MATH Google Scholar
Cohen, M.B., Lee, Y.T., Song, Z.: Solving linear programs in the current matrix multiplication time. J ACM 68(1), 1–39 (2021)
Article MATH Google Scholar
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. PVLDB 7(6), 315–325 (2007)
Google Scholar
Crescenzi, P.: A short guide to approximation preserving reductions. In: CCC, pp. 262–273 (1997)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: $\text{ NADEEF }$: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)
De Sa, C., Ilyas, I.F., Kimelfeld, B., Ré, C., Rekatsinas, T.: A formal framework for probabilistic unclean databases. In: ICDT, pp. 26–28 (2019)
Dixit, A.A.: $\text{ CAvSAT }$: a system for query answering over inconsistent databases. In: SIGMOD, pp. 1823–1825 (2019)
Dixit, A.A., Kolaitis, P.G.: A $\text{ SAT }$-based system for consistent query answering. In: SAT, pp. 117–135 (2019)
Flesca, S., Furfaro, F., Parisi, F.: Consistent query answers on numerical databases under aggregate constraints. In: DBPL, pp. 279–294 (2005)
Flesca, S., Furfaro, F., Parisi, F.: Querying and repairing inconsistent numerical databases. ACM Trans. Database Syst. (2010). https://doi.org/10.1145/1735886.1735893
Article MATH Google Scholar
Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: a challenging application of disjunctive logic programming. In: Logic for Programming, Artificial Intelligence, and Reasoning, pp. 561–578 (2001)
Gartner.: Vendor Rating Service. https://www.gartner.com/en/research/methodolo-gies/vendor-rating. Accessed 15 May 2020
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)
Google Scholar
Golab, L., Ilyas, I.F., Beskales, G., Galiullin, A.: On the relative trust between inconsistent data and inaccurate constraints. In: ICDE, pp. 541–552 (2013)
Guruswami, V., Khot, S.: Hardness of $\text{ M }$ax $3\text{ SAT }$ with no mixed clauses. In: CCC, pp. 154–162 (2005)
Kann, V.: Maximum bounded 3-dimensional matching is $\text{ MAX } \text{ SNP }$-complete. Inf. Process. Lett. 37(1), 27–35 (1991)
Article MATH Google Scholar
Karakostas, G.: A better approximation ratio for the vertex cover problem. ACM Trans. Algorithms 5(4), 41:1-41:8 (2009)
Article MATH Google Scholar
Khot, S.: On the unique games conjecture. In: FOCS, p. 3 (2005)
Khot, S., Regev, O.: Vertex cover might be hard to approximate to within 2-$\epsilon $. J. Comput. Syst. Sci. 74(3), 335–349 (2008)
Article MATH Google Scholar
Kivinen, J., Mannila, H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129–149 (1995)
Article MATH Google Scholar
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)
Kolaitis, P.G., Pema, E., Tan, W.C.: Efficient querying of inconsistent databases with binary integer programming. PVLDB 6(6), 397–408 (2013)
Google Scholar
Koutris, P., Wijsen, J.: Consistent query answering for self-join-free conjunctive queries under primary key constraints. ACM Trans. Database Syst. 42(2), 1–45 (2017)
Article MATH Google Scholar
Livshits, E., Kimelfeld, B., Roy, S.: Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45(1), 1–46 (2020)
Article Google Scholar
Lopatenko, A., Bertossi, L.: Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In: ICDT, pp. 179–193 (2007)
Miao, D., Cai, Z., Li, J., Gao, X., Liu, X.: The computation of optimal subset repairs. Proc. VLDB Endow. 13(11), 2061–2074 (2020)
Article Google Scholar
Nemhauser, G.L., Trotter, L.E.: Vertex packings: structural properties and algorithms. Math. Program. 8(4), 232–248 (1975)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holo$\text{ C }$lean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)
Google Scholar
Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: causal database repair for algorithmic fairness. In: SIGMOD, pp. 793–810 (2019)
Wijsen, J.: Condensed representation of database repairs for consistent query answering. In: ICDT, pp. 378–393 (2003)
Wijsen, J.: Database repairing using updates. In: SIGMOD, vol. 30 (2005)
Wijsen, J.: On the consistent rewriting of conjunctive queries under primary key constraints. Inf. Syst. 34(7), 578–601 (2009)
Article Google Scholar
Wijsen, J.: Certain conjunctive query answering in first-order logic. ACM Trans. Database Syst. 37(2), 1–35 (2012)
Article Google Scholar
Wijsen, J.: User-guided repairing of inconsistent knowledge bases. In: Proceedings of the 21th International Conference on Extending Database Technology (2018). https://doi.org/10.5441/002/EDBT.2018.13
Wijsen, J.: Foundations of query answering on inconsistent databases. SIGMOD Rec. 48(3), 6–16 (2019)
Article Google Scholar
Zehavi, M.: Maximum minimal vertex cover parameterized by vertex cover. SIAM J. Discrete Math. 31(4), 2440–2456 (2017)
Article MATH Google Scholar

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China (NSFC) Grant Nos. 61972110, 61832003, U1811461, and U19A2059.

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Dongjing Miao
University of Science and Technology of China, Hefei, China
Pengfei Zhang
Shenzhen Institute of Advanced Technology Chinese Academy of Sciences, Shenzhen, China
Jianzhong Li
Australian National University, Canberra, Australia
Ye Wang
Georgia State University, Atlanta, USA
Zhipeng Cai

Authors

Dongjing Miao
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Ye Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongjing Miao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs for OPTSR

Lemma 1

Let $\phi $ be the input expression of problem MAX-NM-E3SAT and $N_v$, $N_c$ be the number of variables and clauses in $\phi $. Let $\mathcal {N}_{\max }(\phi )$ denote the maximum number of clauses that can be satisfied in $\phi $. We have $\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c$.

Proof

There are $N_v$ variables in $\phi $ and $2^{N_v}$ kinds of variable assignment. For each assignment $\tau $, the number of clauses in $\phi $ satisfied by $\tau $ is fixed, denoted $\mathcal {N}(\phi )$. Note that the total amount of $\tau $ is $2^{N_v}$, and for each clause containing exactly 3 variables, 7/8 of assignments can satisfy the clause while 1/8 of them cannot. Therefore,

$$\begin{aligned} \sum \mathcal {N}(\phi )=\frac{7}{8}\cdot N_C \cdot 2^{N_v}. \end{aligned}$$

Then, we can see that the expectation of $\mathcal {N}(\phi )$ is

$$\begin{aligned}{\mathbf {E}}(\mathcal {N}(\phi ))=\frac{\sum \mathcal {N}(\phi )}{2^{N_v}}=\frac{7}{8}{N_c}.\end{aligned}$$

Therefore, there exists an input expression $\phi _{0}$ such that $\mathcal {N}(\phi _{0})\ge \mathcal {N}_{avg}(\phi )=\frac{7}{8}N_c$. Combined with $\mathcal {N}_{\max }(\phi ) \ge \mathcal {N}(\phi _{0})$, we can conclude that $\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c$ $\square $

Lemma 2

Let FD set be one of ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$, ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$. Given an $\epsilon >0$, it is NP-hard to compute a $(143/136-\epsilon )$-optimal S-repair for OPTSR even if every tuple in the instance has weight 1.

Proof

Here we reduce the problem MAX-NM-E3SAT to Case 1 and Case 2. We use $\phi $ to denote the input expression of problem MAX-NM-E3SAT, and $N_v$, $N_c$ to denote the number of variables and clauses in $\phi $.

Each reduction builds $I_\phi $ of the relation schema R(A, B, C), in which every tuple has weight 1, for each expression $\phi $.

Reduction for ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$. A tuple is inserted into $I_\phi $ for each clause $c_i$ and each variable $x_j$ in it as follows:

1.
If $c_i$ contains a positive literal of variable $x_j$, insert $(c_i,x_j,x_j)$ into $I_\phi $,
2.
If $c_i$ contains a negative literal of variable $x_j$, insert $(c_i,x_j,{\bar{x}}_j)$ into $I_\phi $.

In total, $3N_c$ tuples are created. Intuitively, $A\rightarrow {B}$ guarantees that exactly one of its three corresponding tuples survives in any S-repair once a clause is satisfied. $B\rightarrow {C}$ guarantees that the assignment of each variable is valid, i.e., either true or false, but not both.

From the perspective of the problem OPTSR, $A \rightarrow {B}$ guarantees that any S-repair J of $I_\phi $ contains at most one of the three tuples with the same value $c_i$ on attribute A, where $1\le i\le N_c$. $B\rightarrow {C}$ guarantees that any S-repair J of $I_\phi $ contains either $(c_i,x_j,x_j)$ or $(c_i,x_j,{\bar{x}}_j)$ for $1\le i\le N_c$ and $1\le j\le N_v$.

From every S-repair J, we derive a variable assignment $\tau $, s.t. $\tau (x_j)=1$ if there exists a tuple $t\in {J}$ in the form $(c_i,x_j,x_j)$, $1 \le i \le N_c$, and $\tau (x_j)=0$ otherwise.

$\tau (\cdot )$ is valid such that $\tau (x_j)$ either 0 or 1, but not both, for each $1\le j\le N_v$, to satisfy the functional dependencies. Let $\tau _{\max }$ be an optimal variable assignment, $\mathcal {N}(\phi )$ denotes the number of clauses in $\phi $ satisfied by $\tau $, and $\mathcal {N}_{\max }(\phi )$ denotes the number of clauses in $\phi $ satisfied by $\tau _{\max }$. Then, we have

$$\begin{aligned} \mathcal {N}_{\max }(\phi ) = |I_\phi | - C_s(J^*,I_{\phi }) \end{aligned}$$

and for any solution J of $I_\phi $,

$$\begin{aligned} \mathcal {N}(\phi ) \ge |I_\phi | - C_s(J,I_{\phi }). \end{aligned}$$

Reduction for ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$. A tuple is inserted into $I_\phi $ for each clause $c_i$ and each variable $x_j$ in it as follows:

1.
If $c_i$ contains a positive literal of variable $x_j$, insert $(c_i,x_j,x_j)$ into $I_\phi $,
2.
If $c_i$ contains a negative literal of variable $x_j$, insert $(c_i,{\bar{x}}_j,x_j)$ into $I_\phi $.

In total, $3N_c$ tuples are created. Similarly, $A\rightarrow {B}$ guarantees that exactly one of its three corresponding tuples survives in any S-repair once a clause is satisfied, and $C\rightarrow {B}$ guarantees the consistency of variable assignment.

Similarly, from the perspective of the problem OPTSR, $A \rightarrow {B}$ guarantees that any S-repair J of $I_\phi $ contains at most one of the three tuples with the same value $c_i$ on attribute A, and $C\rightarrow {B}$ guarantees that any S-repair J of $I_\phi $ contains either $(c_i,x_j,x_j)$ or $(c_i,{\bar{x}}_j,x_j)$ for $1\le i\le N_c$ and $1\le j\le N_v$.

From every S-repair J, we derive a variable assignment $\tau $, s.t. $\tau (x_j)=1$ if there exists a tuple $t\in {J}$ in the form $(c_i,x_j,x_j)$, $1 \le i \le N_c$, and $\tau (x_j)=0$ otherwise.

$\tau (\cdot )$ is valid such that $\tau (x_j)$ either 0 or 1, but not both, for each $1\le j\le N_v$, to satisfy the functional dependencies. We adopt the same definition of $\tau _{\max }$, $\mathcal {N}(\phi )$ and $\mathcal {N}_{\max }(\phi )$ as the reduction for ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$. Then, we have the same formula

$$\begin{aligned} \mathcal {N}_{\max }(\phi ) = |I_\phi | - C_s(J^*,I_{\phi }) \end{aligned}$$

and for any solution J of $I_\phi $,

$$\begin{aligned} \mathcal {N}(\phi ) \ge |I_\phi | - C_s(J,I_{\phi }). \end{aligned}$$

Reduction for ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$. A tuple is inserted into $I_\phi $ for each clause $c_i$ and each variable $x_j$ in it as follows:

1.
If $c_i$ contains a positive literal of variable $x_j$, insert $(c_i,1,x_j)$ into $I_\phi $,
2.
If $c_i$ contains a negative literal of variable $x_j$, insert $(c_i,0,x_j)$ into $I_\phi $.

In total, $3N_c$ tuples are created. $AB\rightarrow {C}$ guarantees that exactly one of the three tuples survives in any S-repair once the corresponding clause is satisfied, and $C\rightarrow {B}$ guarantees the consistency of variable assignment.

Similarly, from the perspective of the problem OPTSR, $AB \rightarrow {C}$ guarantees that any S-repair J of $I_\phi $ contains at most one of the three tuples with the same value $c_i$ on attribute A, and $C\rightarrow {B}$ guarantees that any S-repair J of $I_\phi $ contains either $(c_i,1,x_j)$ or $(c_i,0,x_j)$ for $1\le i\le N_c$ and $1\le j\le N_v$.

From every S-repair J, we derive a variable assignment $\tau $, s.t. $\tau (x_j)=1$ if there exists a tuple $t\in {J}$ in the form $(c_i,1,x_j)$, $1 \le i \le N_c$, and $\tau (x_j)=0$ otherwise.

$\tau (\cdot )$ is valid such that $\tau (x_j)$ either 0 or 1, but not both, for each $1\le j\le N_v$, to satisfy the functional dependencies. We adopt the same definition of $\tau _{\max }$, $\mathcal {N}(\phi )$ and $\mathcal {N}_{\max }(\phi )$. Then, we can get the same formula for $\mathcal {N}(\phi )$ and $\mathcal {N}_{\max }(\phi )$.

Deriving lower bound. We only show the process of deriving the lower bound for ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$. The processes of deriving ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$ are omitted here since the same bound can be derived in the same way.

Let $k>1$ and J a k-optimal S-repair of $I_\phi $ such that

$$\begin{aligned} C_s(J^*,I_{\phi }) \le C_s(J,I_{\phi }) \le k\cdot C_s(J^*,I_{\phi }), \end{aligned}$$

thus we have

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}\ge & {} \frac{|I_\phi |-C_s(J,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\nonumber \\\ge & {} \frac{|I_\phi |-k\cdot C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\nonumber \\= & {} 1+(1-k)\cdot \frac{C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}. \end{aligned}$$

(6)

Note that $|I_\phi |=3N_c$. Due to LEMMA 1, $\mathcal {N}_{\max }(\phi )\ge \frac{7}{8}N_c$. Hence,

$$\begin{aligned}C_s(J^*,I_{\phi })=|I_\phi |-\mathcal {N}_{\max }(\phi ) \le 3N_c-\frac{7}{8}N_c=\frac{17}{8}N_c. \end{aligned}$$

By applying this fact in the right hand of inequality (6), the following inequality holds

$$\begin{aligned} \frac{C_s(J^*,I_{\phi })}{|I_\phi |-C_s(J^*,I_{\phi })}\le \frac{\frac{17}{8}N_c}{3N_c-\frac{17}{8}N_c}=\frac{17}{7}. \end{aligned}$$

(7)

We apply inequality (7) into inequality (6), then

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}>\frac{24}{7}-\frac{17}{7}k \end{aligned}$$

(8)

Inequality 8 implies the problem MAX-NM-E3SAT can be approximated with ratio $\frac{24}{7}-\frac{17}{7}k$ if a k-optimal S-repair can be computed in polynomial time. Suppose $k<143/136$, then $\frac{24}{7}-\frac{17}{7}k>7/8$, which implies the MAX-NM-E3SAT problem admits an approximation with a ratio better than 7/8. This is contrary to the hardness result obtained in [40]. Therefore, we have $k\ge 143/136$ which is at least 1.05. $\square $

Lemma 3

Let FD set be ${{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}$, it is NP-hard to compute a $(69246103/69246100 - \epsilon )$-optimal S-repair for any $\epsilon >0$, even if every tuple in the instance has weight 1.

Proof

By merging the following $\mathcal {L}_{\alpha , \beta }$-reductions given in previous literature [3, 41, 49],

$$\begin{aligned}\small MAX B29-3SAT&\prec _{529,1}&3DM \\ 3DM&\prec _{1,1}&MAX 3SC \\ MAX 3SC&\prec _{55,1}&MECT-B\\ MECT-B&\prec _{\frac{7}{6},1}&O{PT}SR({\mathsf {R}}, {{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}) \end{aligned}$$

we conclude that MAX B29-3SAT can be approximated within 680/679 if a $(69246103/69246100-\epsilon )$-optimal S-repair can be computed in polynomial time for any $\epsilon >0$ when ${\Sigma }$ is ${{\Sigma }_{AB\leftrightarrow {AC}\leftrightarrow {BC}}}$, which is contrary to the hardness result shown in [29]. $\square $

Appendix B: Proofs for OPTUR

Theorem 5

Let ${\mathsf {R}}$ be a fixed relation schema. For any finite fixed FD set ${\Sigma }$ over ${\mathsf {R}}$, the construction from instance I of ${\mathsf {R}}$ to the conflict hypergraph is an L-reduction with $\alpha =1$, $\beta =\frac{\sigma +2}{2}$.

Proof

From Definition 3, for any instance I of ${\mathsf {R}}$, each hyperedge in the conflict hypergraph of I represents a conflicting set of positions. Therefore, in any U-repair L of I (even optimal U-repair), at least one position in every hyperedge of $G_{I,{\Sigma }}$ should get a new value to eliminate conflict. By denoting ${\textit{VC}}(G_{I,{\Sigma }})$ as the vertex cover cost of graph $G_{I,{\Sigma }}$ and ${\textit{VC}}_{min}(G_{I,{\Sigma }})$ as the minimum vertex cover cost, we have

$$\begin{aligned} \textit{VC}_{min}(G_{I,{\Sigma }})\le C_{upd}(I,L*). \end{aligned}$$

(9)

On the other side, according to [46], we have

$$\begin{aligned} C_{upd}(I,L)\le \left( \frac{MCI+2}{2}\right) {\textit{VC}}(G_{I,{\Sigma }}), \end{aligned}$$

where MCI denotes the size of the largest minimum core implicant^{Footnote 8} over all attributes in $attr({\Sigma })$ [46].

For each attribute C in $attr({\Sigma })$, the minimum core implicant of C holds smaller or less cardinality than the number of functional dependencies in ${\Sigma }$, i.e., $MCI\le \sigma $. Therefore,

$$\begin{aligned} C_{upd}(I,L)\le (\frac{\sigma +2}{2}){\textit{VC}}(G_{I,{\Sigma }}). \end{aligned}$$

(10)

From Eqs. (9) and (10), if $\sigma $ is finite, the construction from I to the conflict hypergraph $G_{I,{\Sigma }}$ is an L-reduction with $\alpha =1$, $\beta =\frac{\sigma +2}{2}$. $\square $

Theorem 5 shows that we can transform an update on the instance into a vertex cover of the conflict hypergraph with no extra cost, and, under finite FD set ${\Sigma }$, a vertex cover of the conflict hypergraph into an update on the instance with extra costs. In addition, as L-reduction preserves membership in NP class for, Definition 3 assists the complexity analysis for OPTUR. We establish the following consequences of theorem 5 and show the hardness of OPTUR problem in certain representative cases.

Corollary 1

Let ${\mathsf {R}}$ be a fixed relation schema and I the instance of ${\mathsf {R}}$. For a finite FD set ${\Sigma }$, if there are two FDs in ${\Sigma }$ such that $X\rightarrow A,Y\rightarrow A,X\ne Y$, it is NP-hard to compute an optimal U-repair for I with respect to ${\Sigma }$.

Proof

We investigate the problem of vertex cover on the conflict hypergraph $G_{I,{\Sigma }}$ constructed with I and ${\Sigma }$. From Definition 3, we can deduce that all hyperedges in $G_{I,{\Sigma }}$ cover more than 4 vertices and less than $2\cdot |attr({\Sigma })|^2$ vertices (which is also finite). Therefore, as there are hyperedges covering different number of vertices in $G_{I,{\Sigma }}$, the problem of vertex cover on $G_{I,{\Sigma }}$ is NP-hard [9, 19]. Then, we can deduce that OPTUR under ${\Sigma }$ is NP-hard. $\square $

Corollary 2

For an FD set ${\Sigma }$, if there are two FDs in ${\Sigma }$ such that $X\rightarrow A,Y\rightarrow B,A\in Y,(Y-A)\cap X=\varnothing ,B\ne X$, it is NP-hard to compute an optimal U-repair for I with respect to ${\Sigma }$.

Proof

The proof is the same as that of Corollary 1, except the hyperedges in $G_{I,{\Sigma }}$ cover more than 4 vertices and less than $3\cdot |attr({\Sigma })|^3$ vertices (due to $B\ne X$ and $(Y-A)\cap X=\varnothing $). $\square $

As a result of Corollaries 1 and 2, we can easily deduce that OPTUR under ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$, ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$ are NP-hard. The FDs satisfying the condition of Corollary 1 for ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$ are {$A\rightarrow B,C\rightarrow B$}. The FDs satisfying the condition of Corollary 2 for ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$ are {$A\rightarrow B,B\rightarrow C$} and {$AB\rightarrow C,C\rightarrow B$}, respectively. Our other contribution, though not general, is that we derive the lower bounds of these example cases unknown previously.

Lemma 6

Let FD set be one of ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$, ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$. Given an $\epsilon >0$, it is NP-hard to compute a $(143/136-\epsilon )$-optimal U-repair for OPTUR even if every tuple in the instance has weight 1.

Proof

Here we also reduce the problem MAX-NM-E3SAT to problem OPTUR when ${\Sigma }$ is ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$, ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ or ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$. We use the same denotations $\phi $, $N_v$ and $N_c$ as those in the proof of LEMMA 2.

Each reduction builds $I_\phi $ of the relation schema R(A, B, C), in which every tuple has weight 1, for each expression $\phi $.

Reduction for ${{\Sigma }_{A\rightarrow {B}\rightarrow {C}}}$. A tuple is inserted into $I_\phi $ for each clause $c_i$ and each variable $x_j$ in it as follows:

1.
If $c_i$ contains a positive literal of variable $x_j$, insert $(c_i,x_j,1)$ into $I_\phi $,
2.
If $c_i$ contains a negative literal of variable $x_j$, insert $(c_i,x_j,0)$ into $I_\phi $.

In total, $3N_c$ tuples are created.

Intuitively, $A\rightarrow {B}$ guarantees that for each clause, at least two of its three corresponding tuples should be updated to resolve conflict. $B\rightarrow {C}$ guarantees the assignment of each variable is valid, i.e., either true or false, but not both. $A\rightarrow {B}\rightarrow {C}$ guarantees:

1.
Once a clause is satisfied, then at least two of its three corresponding tuples are updated in any U-repair, and the update with minimal cost is to change only the value on B in each tuple.
2.
Once a clause is not satisfied, then exactly all of its three corresponding tuples are updated in any U-repair, and the update with minimal cost is to change only the value on B in each tuple.

Hence, from every U-repair L, we derive a variable assignment $\tau $, s.t. $\tau (x_j)=1$ if there exists a tuple $t\in {L}$ in the form $(c_i,x_j,1)$ and $\sum \limits _{A\in attr(I_{\phi })} \chi _{I_{\phi }.t[A],L.t[A]}$ $=0$ (i.e., t is not changed during the update from $I_{\phi }$ to L), $1 \le i \le n$, and $\tau (x_j)=0$ otherwise. $\tau (\cdot )$ is valid such that $\tau (x_j)$ either 0 or 1, but not both, for each $1\le j\le N_v$, to satisfy the functional dependencies. Then, we can construct a U-repair $L'$ in the following way, which derive the same assignment $\tau (\cdot )$ as L, but change only the value on attribute B in corresponding tuples:

1.
Derive the assignment $\tau (\cdot )$ from L.
2.
If a clause $c_i=(x_{j1}\vee x_{j2}\vee x_{j3})$ can be satisfied by the assignment $\tau (x_{j1})$ (Actually, in $\{x_{j1},x_{j2},x_{j3}\}$, at least one variable’s assignment satisfy the clause $c_i$. For convenience of discussion, we use $x_{j1}$ to refer to a representative variable that satisfy $c_i$), insert $(c_i,x_{j1},1)$, $(c_i,x_{j1},1)$, $(c_i,x_{j1},1)$ to $L'$. On the other side, if a clause $c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})$ can be satisfied by the assignment $\tau (x_{j1})$, insert $(c_i,x_{j1},0)$, $(c_i,x_{j1},0)$, $(c_i,x_{j1},0)$ to $L'$.
3.
If a clause $c_i=(x_{j1}\vee x_{j2}\vee x_{j3})$ cannot be satisfied by all of the assignment $\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})$, select a variable $x_k$ such that $\tau (x_k)=1$ and insert $(c_i,x_{k},1)$, $(c_i,x_{k},1)$, $(c_i,x_{k},1)$ to $L'$, or insert $(c_i,1,1)$, $(c_i,1,1)$, $(c_i,1,1)$ instead if no variable is assigned to 1. On the other side, if a clause $c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})$ cannot be satisfied by the assignment $\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})$, select a variable $x_k$ such that $\tau (x_k)=0$ and insert $(c_i,x_{k},0)$, $(c_i,x_{k},0)$, $(c_i,x_{k},0)$ to $L'$, or insert $(c_i,0,0)$, $(c_i,0,0)$, $(c_i,0,0)$ instead if no variable is assigned to 0.

Intuitively, among all update repairs that derive the assignment $\tau (\cdot )$, the cost $C_u(L',I_{\phi })$ is minimal. Let $\tau _{\max }$ be an optimal variable assignment. $\mathcal {N}(\phi )$ denotes the number of clauses in $\phi $ satisfied by $\tau $, and $\mathcal {N}_{\max }(\phi )$ denotes the number of clauses in $\phi $ satisfied by $\tau _{\max }$. All denotations are the same as the proof in LEMMA 2. Then, we have

$$\begin{aligned}C_u(L,I_{\phi })\ge C_u(L',I_{\phi })=2\mathcal {N}(\phi )+3(N_c-\mathcal {N}(\phi )),\end{aligned}$$

which is equivalent to

$$\begin{aligned} \mathcal {N}(\phi )\ge 3N_c-C_u(L,I_{\phi }),\end{aligned}$$

We can conclude that $L'$ is the optimal U-repair of $I_{\phi }$ if and only if the number of clauses in $\phi $ satisfied by $\tau $ is maximal, i.e., $\mathcal {N}(\phi )=\mathcal {N}_{max}(\phi )$. Therefore,

$$\begin{aligned}\mathcal {N}_{max}(\phi )= 3N_c-C_u(L^*,I_{\phi }).\end{aligned}$$

Reduction for ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$. The process of reduction for ${{\Sigma }_{A\rightarrow {B}\leftarrow {C}}}$ and ${{\Sigma }_{AB\rightarrow {C}\rightarrow {B}}}$ is abbreviated since the same result can be derived in the same way. We use the same denotations as previous.

A tuple is inserted into $I_\phi $ for each clause $c_i$ and each variable $x_j$ in it as follows:

1.
If $c_i$ contains a positive literal of variable $x_j$, insert $(c_i,1,x_j)$ into $I_\phi $,
2.
If $c_i$ contains a negative literal of variable $x_j$, insert $(c_i,0,x_j)$ into $I_\phi $.

In total, $3N_c$ tuples are created.

From every U-repair L, we derive a variable assignment $\tau $, s.t. $\tau (x_j)=1$ if there exists a tuple $t\in {L}$ in the form $(c_i,1,x_j)$ and $\sum \limits _{A\in attr(I_{\phi })} \chi _{I_{\phi }.t[A],L.t[A]}=0$ (i.e., t is not changed during the update from $I_{\phi }$ to L), $1 \le i \le N_v$, and $\tau (x_j)=0$ otherwise.

Then, we can construct a U-repair $L'$ in the following way, which derive the same assignment $\tau (\cdot )$ as L, but change only the value on attribute B in corresponding tuples:

1.
Derive assignment $\tau (\cdot )$ from L.
2.
If a clause $c_i=(x_{j1}\vee x_{j2}\vee x_{j3})$ can be satisfied by assignment $\tau (x_{j1})$, insert $(c_i,1,x_{j1})$, $(c_i,1,x_{j1})$, $(c_i,1,x_{j1})$ to $L'$. On the other side, if a clause $c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})$ can be satisfied by assignment $\tau (x_{j1})$, insert $(c_i,0,x_{j1})$, $(c_i,0,x_{j1})$, $(c_i,0,x_{j1})$ to $L'$.
3.
If a clause $c_i=(x_{j1}\vee x_{j2}\vee x_{j3})$ cannot be satisfied by assignment $\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})$, select a variable $x_k$ such that $\tau (x_k)=1$ and insert $(c_i,1,x_{k})$, $(c_i,1,x_{k})$, $(c_i,1,x_{k})$ to $L'$, or insert $(c_i,1,1)$, $(c_i,1,1)$, $(c_i,1,1)$ instead if no variable is assigned to 1. On the other side, if a clause $c_i=({\bar{x}}_{j1}\vee {\bar{x}}_{j2}\vee {\bar{x}}_{j3})$ cannot be satisfied by assignment $\tau (x_{j1}),\tau (x_{j2}),\tau (x_{j3})$, select a variable $x_k$ such that $\tau (x_k)=0$ and insert $(c_i,0,x_{k})$, $(c_i,0,x_{k})$, $(c_i,0,x_{k})$ to $L'$, or insert $(c_i,0,0)$, $(c_i,0,0)$, $(c_i,0,0)$ instead if no variable is assigned to 0.

Intuitively, among all update repairs that derive the assignment $\tau (\cdot )$, the cost $C_u(L',I_{\phi })$ is minimal. Therefore, we have $\mathcal {N}(\phi )\ge 3N_c-C_u(L,I_{\phi })$ and $\mathcal {N}_{max}(\phi )= 3N_c-C_u(L^*,I_{\phi })$.

Deriving lower bound. As we can see, the formulas of $\mathcal {N}(\phi )$ and $\mathcal {N}_{\max }(\phi )$ are the same as those in the proof of LEMMA 2. Similarly,

$$\begin{aligned} \frac{\mathcal {N}(\phi )}{\mathcal {N}_{\max }(\phi )}>\frac{24}{7}-\frac{17}{7}k \end{aligned}$$

(11)

and we can conclude that the problem MAX-NM-E3SAT can be approximated with ratio $\frac{24}{7}-\frac{17}{7}k$ if a k-optimal U-repair can be computed in polynomial time. Suppose $k<143/136$, then $\frac{24}{7}-\frac{17}{7}k>7/8$, which implies the MAX-NM-E3SAT problem admits an approximation with ratio better than 7/8. This contraries to the hardness result obtained in [40]. Therefore, we have $k\ge 143/136$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miao, D., Zhang, P., Li, J. et al. Approximation and inapproximability results on computing optimal repairs. The VLDB Journal 32, 173–197 (2023). https://doi.org/10.1007/s00778-022-00738-0

Download citation

Received: 03 July 2021
Revised: 23 February 2022
Accepted: 25 February 2022
Published: 12 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00778-022-00738-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximation and inapproximability results on computing optimal repairs

Abstract

Access this article

Similar content being viewed by others

Sampling Query Feedback Restricted Repairs of Functional Dependency Violations: Complexity and Algorithm

On the complexity and approximability of repair position selection problem

Repair Position Selection for Inconsistent Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Proofs for OPTSR

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Appendix B: Proofs for OPTUR

Theorem 5

Proof

Corollary 1

Proof

Corollary 2

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Approximation and inapproximability results on computing optimal repairs

Abstract

Access this article

Similar content being viewed by others

Sampling Query Feedback Restricted Repairs of Functional Dependency Violations: Complexity and Algorithm

On the complexity and approximability of repair position selection problem

Repair Position Selection for Inconsistent Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Proofs for OPTSR

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Appendix B: Proofs for OPTUR

Theorem 5

Proof

Corollary 1

Proof

Corollary 2

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation