skip to main content
10.1145/3464509.3464888acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Leveraging Approximate Constraints for Localized Data Error Detection

Published:20 June 2021Publication History

ABSTRACT

Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs.

Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors.

After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.

References

  1. Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where Are We and What Needs to Be Done?Proc. VLDB Endow. 9, 12 (Aug. 2016), 993–1004. https://doi.org/10.14778/2994509.2994518 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Vol. 10. Morgan & Claypool Publishers. 1–154 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. Classification and regression trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA. https://cds.cern.ch/record/2253780Google ScholarGoogle Scholar
  5. Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 2201–2206. https://doi.org/10.1145/2882903.2912574 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD ’02). Association for Computing Machinery, New York, NY, USA, 240–251. https://doi.org/10.1145/564691.564719 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81–102.Google ScholarGoogle ScholarCross RefCross Ref
  9. Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 829–846. https://doi.org/10.1145/3299869.3319888 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Joseph M. Hellerstein. 2008. Quantitative Data Cleaning for Large Databases.Google ScholarGoogle Scholar
  11. IBM. [n.d.]. The Four V’s of Big Data.Accessed: 2020-01-15.Google ScholarGoogle Scholar
  12. Batya Kenig and Dan Suciu. 2019. Integrity Constraints Revisited: From Exact to Approximate Implication. arXiv preprint arXiv:1812.09987(2019).Google ScholarGoogle Scholar
  13. Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. 2005. TinyDB: an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS) 30, 1 (2005), 122–173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD). ACM, 865–882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zelda Mariet, Rachael Harding, Sam Madden, 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (2016).Google ScholarGoogle Scholar
  16. Nikita Patel and Saurabh Upadhyay. 2012. Study of Various Decision Tree Pruning Methods with their Empirical Comparison in WEKA. Int. J. Comput. Appl. 60 (12 2012), 20–25. https://doi.org/10.5120/9744-4304Google ScholarGoogle ScholarCross RefCross Ref
  17. Saharon Rosset, Claudia Perlich, Grzergorz Świrszcz, Prem Melville, and Yan Liu. 2010. Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery 20, 3 (2010), 439–468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Oliver Schulte, Yejia Liu, and Chao Li. 2018. Model Trees for Identifying Exceptional Players in the NHL Draft. arXiv preprint arXiv:1802.08765(2018).Google ScholarGoogle Scholar
  19. Pei Wang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. In International Conference on Management of Data (SIGMOD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Larry Wasserman. 2013. All of statistics: a concise course in statistical inference. Springer Science & Business Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553–564. https://doi.org/10.14778/2536354.2536356 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data Management
    June 2021
    44 pages
    ISBN:9781450385350
    DOI:10.1145/3464509

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 20 June 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate19of26submissions,73%
  • Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)3

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format