Skip to main content

Computing Maximal Likelihood Subset Repair for Inconsistent Data

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

In this paper, we study the problem of subset repair under integrity constraints. For an inconsistent data set, a subset repair removes a minimal set of tuples such that the integrity constraints are no longer violated in the remaining tuples. There usually exist multiple subset repairs and it is difficult to determine which one is optimal. Most previous work prefer the one with minimum number of deleted tuples to avoid excessive removal and information loss. However, it will delete clean tuples and retain dirty tuples when the majority of tuples are dirty in a local scope. We intuitively notice that under a proper model, the correctness probabilities of clean tuples are often larger than that of dirty tuples, and therefore we propose to determine the subset repair with maximum likelihood, which retain tuples with large correctness probability as many as possible. In this paper, we first formalize the maximum likelihood subset repair problem and analyze the hardness. Then we propose a correctness probability model, together with a scalable inference approach. Finally, an efficient approximate algorithm is proposed to compute the maximum likelihood subset repair. Extensive experiments on real-world datasets show that our proposal can achieve higher precision and recall compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/HoloClean/holo-clean/tree/master/testdata/flight.csv.

  2. 2.

    https://github.com/HoloClean/holo-clean/tree/master/testdata/hospital.csv.

  3. 3.

    https://github.com/minhptx/spade.

  4. 4.

    https://db.unibas.it/projects/bart/.

References

  1. Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: Proceedings of the 12th International Conference on Database Theory, pp. 31–41 (2009)

    Google Scholar 

  2. Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)

    Article  MathSciNet  Google Scholar 

  3. Miao, D., et al.: The computation of optimal subset repairs. Proc. VLDB Endow. 13(12), 2061–2074 (2020)

    Article  Google Scholar 

  4. Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory, pp. 53–62 (2009)

    Google Scholar 

  5. Rekatsinas, T., et al.: HoloClean: holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017)

  6. Mahdavi, M., et al.: Raha: A configuration-free error detection system. In: Proceedings of the 2019 International Conference on Management of Data, pp. 865–882 (2019)

    Google Scholar 

  7. Arenas, M., et al.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003)

    Article  MathSciNet  Google Scholar 

  8. Livshits, E., Kimelfeld, B., Roy, S.: Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. (TODS) 45(1), 1–46 (2020)

    Article  MathSciNet  Google Scholar 

  9. Dallachiesa, M., et al.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 541–552 (2013)

    Google Scholar 

  10. Khayyat, Z., et al.: BigDansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1215–1230 (2015)

    Google Scholar 

  11. Abedjan, Z., et al.: Temporal rules discovery for web data cleaning. Proc. VLDB Endow. 9(4), 336–347 (2015)

    Article  Google Scholar 

  12. Geerts, F., et al.: The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6(9), 625–636 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anzhen Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, A., Hu, S., Zong, C., Li, J., Xia, X. (2024). Computing Maximal Likelihood Subset Repair for Inconsistent Data. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2390-4_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2389-8

  • Online ISBN: 978-981-97-2390-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics