skip to main content
10.1145/2882903.2882955acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Constraint-Variance Tolerant Data Repairing

Published:14 June 2016Publication History

ABSTRACT

Integrity constraints, guiding the cleaning of dirty data, are often found to be imprecise as well. Existing studies consider the inaccurate constraints that are oversimplified, and thus refine the constraints via inserting more predicates (attributes). We note that imprecise constraints may not only be oversimplified so that correct data are erroneously identified as violations, but also could be overrefined that the constraints overfit the data and fail to identify true violations. In the latter case, deleting excessive predicates applies.

To address the oversimplified and overrefined constraint inaccuracies, in this paper, we propose to repair data by allowing a small variation (with both predicate insertion and deletion) on the constraints. A novel θ-tolerant repair model is introduced, which returns a (minimum) data repair that satisfies at least one variant of the constraints (with constraint variation no greater than θ compared to the given constraints). To efficiently repair data among various constraint variants, we propose a single round, sharing enabled approach. Results on real data sets demonstrate that our proposal can capture more accurate data repairs compared to the existing methods with/without constraint repairs.

References

  1. L. E. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4--5):407--434, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1--2):90--121, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Fan. Dependencies revisited for improving data quality. In PODS, pages 159--170, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574--585, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Karmarkar. A new polynomial-time algorithm for linear programming. In STOC, pages 302--311, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD Conference, pages 1187--1198, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  17. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Song, A. Zhang, J. Wang, and P. S. Yu. SCREEN: stream data cleaning under speed constraints. In SIGMOD, pages 827--841, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. V. Vazirani. Approximation algorithms. Springer, 2001. Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Wijsen. Database repairing using updates. ACM Trans. Database Syst., 30(3):722--768, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Constraint-Variance Tolerant Data Repairing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
      June 2016
      2300 pages
      ISBN:9781450335317
      DOI:10.1145/2882903

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader