skip to main content
research-article

Dependable Data Repairing with Fixing Rules

Published: 30 June 2017 Publication History

Abstract

One of the main challenges that data-cleaning systems face is to automatically identify and repair data errors in a dependable manner. Though data dependencies (also known as integrity constraints) have been widely studied to capture errors in data, automated and dependable data repairing on these errors has remained a notoriously difficult problem. In this work, we introduce an automated approach for dependably repairing data errors, based on a novel class of fixing rules. A fixing rule contains an evidence pattern, a set of negative patterns, and a fact value. The heart of fixing rules is deterministic: given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely capture which attribute is wrong, and the fact indicates how to correct this error. We study several fundamental problems associated with fixing rules and establish their complexity. We develop efficient algorithms to check whether a set of fixing rules are consistent and discuss approaches to resolve inconsistent fixing rules. We also devise efficient algorithms for repairing data errors using fixing rules. Moreover, we discuss approaches on how to generate a large number of fixing rules from examples or available knowledge bases. We experimentally demonstrate that our techniques outperform other automated algorithms in terms of the accuracy of repairing data errors, using both real-life and synthetic data.

References

[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
[2]
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. 2009. Learning string transformations from examples. Proc. VLDB 2, 1 (2009).
[3]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In PODS. 68--79.
[4]
C. Batini and M. Scannapieco. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer.
[5]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2011. Data cleaning and query answering with matching dependencies and matching functions. In ICDT. 268--279.
[6]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB 3, 1 (2010), 197--207.
[7]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In ICDE.
[8]
George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, and Shai Ben-David. 2009. Modeling and querying possible repairs in duplicate detection. Proc. VLDB 2, 1 (2009), 598--609.
[9]
Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD.
[10]
Loreto Bravo, Wenfei Fan, and Shuai Ma. 2007. Extending dependencies with conditions. In VLDB. 243--254.
[11]
Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proc. VLDB 1, 1 (2008).
[12]
Fei Chiang and Renée J. Miller. 2011. A unified model for data and constraint repair. In ICDE.
[13]
J. Chomicki and J. Marcinkowski. 2005. Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197, 1--2 (2005), 90--121.
[14]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013a. Discovering denial constraints. Proc. VLDB 6, 13 (2013).
[15]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013b. Holistic data cleaning: Putting violations into context. In ICDE.
[16]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD. 1247--1261.
[17]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In VLDB. 315--326.
[18]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In SIGMOD.
[19]
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.
[20]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. TODS (2008).
[21]
Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23, 5 (2011).
[22]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB 2, 1 (2009), 407--418.
[23]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In SIGMOD. 469--480.
[24]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB J. 21, 2 (2012), 213--238.
[25]
I. Fellegi and D. Holt. 1976. A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71, 353 (1976), 17--35.
[26]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB 6, 9 (2013), 625--636.
[27]
Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, and Divesh Srivastava. 2014. Discovering conservation rules. IEEE Trans. Knowl. Data Eng. 26, 6 (2014).
[28]
Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB 1, 1 (2008).
[29]
Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2009. Data Quality and Record Linkage Techniques. Springer.
[30]
Matteo Interlandi and Nan Tang. 2015. Proof positive and negative in data cleaning. In ICDE. 18--29.
[31]
Solmaz Kolahi and Laks Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In ICDT. 53--62.
[32]
Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. In ICDE.
[33]
Xiang Lian, Lei Chen, and Shaoxu Song. 2010. Consistent query answers in inconsistent probabilistic databases. In SIGMOD.
[34]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A database approach for statistical inference and data cleaning. In SIGMOD Conference. 75--86.
[35]
Felix Naumann, Alexander Bilke, Jens Bleiholder, and Melanie Weis. 2006. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29, 2 (2006), 21--31.
[36]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB.
[37]
Rishabh Singh and Sumit Gulwani. 2012. Learning semantic string transformations from examples. Proc. VLDB 5, 8 (2012), 740--751.
[38]
Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013).
[39]
Shaoxu Song, Hong Cheng, Jeffrey Xu Yu, and Lei Chen. 2014. Repairing vertex labels under neighborhood constraints. Proc. VLDB 7, 11 (2014).
[40]
Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Renée J. Miller. 2014. Continuous data cleaning. In ICDE. 244--255.
[41]
Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In SIGMOD.
[42]
Jef Wijsen. 2005. Database repairing using updates. ACM Trans. Database Syst. 30, 3 (2005), 722--768.
[43]
Mohamed Yakout, Laure Berti-Equille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD. 553--564.
[44]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. Proc. VLDB 4, 5 (2011), 279--289.

Cited By

View all
  • (2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
  • (2022)A Novel Data Cleaning Framework Based on Knowledge Graph2022 8th International Conference on Big Data Computing and Communications (BigCom)10.1109/BigCom57025.2022.00050(350-355)Online publication date: Aug-2022
  • (2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
  • Show More Cited By
  1. Dependable Data Repairing with Fixing Rules

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 8, Issue 3-4
    Challenge Papers, Experience Paper and Research Papers
    July 2017
    114 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3120924
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 June 2017
    Accepted: 01 January 2017
    Revised: 01 September 2016
    Received: 01 November 2015
    Published in JDIQ Volume 8, Issue 3-4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data repairing
    2. dependable
    3. fixing rules

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
    • (2022)A Novel Data Cleaning Framework Based on Knowledge Graph2022 8th International Conference on Big Data Computing and Communications (BigCom)10.1109/BigCom57025.2022.00050(350-355)Online publication date: Aug-2022
    • (2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
    • (2020)Amplifying Domain Expertise in Clinical Data PipelinesJMIR Medical Informatics10.2196/196128:11(e19612)Online publication date: 5-Nov-2020
    • (2020)Cleaning Data with Forbidden ItemsetsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.290554832:8(1489-1501)Online publication date: 1-Aug-2020
    • (2020)Interactive rule correction, imputation and execution in rule-driven database completion system2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC42975.2020.9283005(426-431)Online publication date: 11-Oct-2020
    • (2019)ICARUSProceedings of the VLDB Endowment10.14778/3275366.328497011:13(2263-2276)Online publication date: 17-Jan-2019
    • (2019)One-Pass Inconsistency Detection Algorithms for Big DataIEEE Access10.1109/ACCESS.2019.28987077(22377-22394)Online publication date: 2019

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media