ABSTRACT
Integrity constraints, guiding the cleaning of dirty data, are often found to be imprecise as well. Existing studies consider the inaccurate constraints that are oversimplified, and thus refine the constraints via inserting more predicates (attributes). We note that imprecise constraints may not only be oversimplified so that correct data are erroneously identified as violations, but also could be overrefined that the constraints overfit the data and fail to identify true violations. In the latter case, deleting excessive predicates applies.
To address the oversimplified and overrefined constraint inaccuracies, in this paper, we propose to repair data by allowing a small variation (with both predicate insertion and deletion) on the constraints. A novel θ-tolerant repair model is introduced, which returns a (minimum) data repair that satisfies at least one variant of the constraints (with constraint variation no greater than θ compared to the given constraints). To efficiently repair data among various constraint variants, we propose a single round, sharing enabled approach. Results on real data sets demonstrate that our proposal can capture more accurate data repairs compared to the existing methods with/without constraint repairs.
- L. E. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4--5):407--434, 2008. Google ScholarDigital Library
- G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarDigital Library
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
- P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarDigital Library
- F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarDigital Library
- J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1--2):90--121, 2005. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
- W. Fan. Dependencies revisited for improving data quality. In PODS, pages 159--170, 2008. Google ScholarDigital Library
- L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574--585, 2009. Google ScholarDigital Library
- L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008. Google ScholarDigital Library
- N. Karmarkar. A new polynomial-time algorithm for linear programming. In STOC, pages 302--311, 1984. Google ScholarDigital Library
- J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995. Google ScholarDigital Library
- S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD Conference, pages 1187--1198, 2014. Google ScholarDigital Library
- A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarCross Ref
- G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarDigital Library
- J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003. Google ScholarDigital Library
- S. Song, A. Zhang, J. Wang, and P. S. Yu. SCREEN: stream data cleaning under speed constraints. In SIGMOD, pages 827--841, 2015. Google ScholarDigital Library
- V. V. Vazirani. Approximation algorithms. Springer, 2001. Google ScholarCross Ref
- J. Wijsen. Database repairing using updates. ACM Trans. Database Syst., 30(3):722--768, 2005. Google ScholarDigital Library
Index Terms
- Constraint-Variance Tolerant Data Repairing
Recommendations
Cleaning Denial Constraint Violations through Relaxation
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataData cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes ...
Multi-source data repairing powered by integrity constraints and source reliability
Highlights- The first study of the data repairing problem for jointly resolving inconsistencies and conflicts.
AbstractIt is crucial to identify and resolve the inconsistencies and conflicts in data. To tackle the inconsistencies, integrity constraints are involved to constrain the attribute values of related entities. As for the multi-source conflicts,...
Diversifying repairs of Denial constraint violations
AbstractDenial constraints (DCs) are expressive enough to subsume many other dependencies, and proven useful in data cleaning for improving data quality. As a complement to the methods of computing a single (nearly) optimum repair of DC ...
Highlights- Relationship between repairs and vertex covers of the conflict hypergraph.
- ...
Comments