research-article

Leveraging Approximate Constraints for Localized Data Error Detection

Authors:
Mohan Zhang

Simon Fraser University, Canada

Simon Fraser University, Canada
View Profile

,
Oliver Schulte

Simon Fraser University, Canada

Simon Fraser University, Canada
View Profile

,
Yudong Luo

Simon Fraser University, Canada

Simon Fraser University, Canada
View Profile

aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data ManagementJune 2021Pages 36–44https://doi.org/10.1145/3464509.3464888

Published:20 June 2021Publication History

aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data Management

Pages 36–44

ABSTRACT

Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs.

Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors.

After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.

References

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where Are We and What Needs to Be Done?Proc. VLDB Endow. 9, 12 (Aug. 2016), 993–1004. https://doi.org/10.14778/2994509.2994518 Google ScholarDigital Library
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004. Google ScholarDigital Library
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Vol. 10. Morgan & Claypool Publishers. 1–154 pages. Google ScholarDigital Library
Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. Classification and regression trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA. https://cds.cern.ch/record/2253780Google Scholar
Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 2201–2206. https://doi.org/10.1145/2882903.2912574 Google ScholarDigital Library
Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509. Google ScholarDigital Library
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD ’02). Association for Computing Machinery, New York, NY, USA, 240–251. https://doi.org/10.1145/564691.564719 Google ScholarDigital Library
David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81–102.Google ScholarCross Ref
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 829–846. https://doi.org/10.1145/3299869.3319888 Google ScholarDigital Library
Joseph M. Hellerstein. 2008. Quantitative Data Cleaning for Large Databases.Google Scholar
IBM. [n.d.]. The Four V’s of Big Data.Accessed: 2020-01-15.Google Scholar
Batya Kenig and Dan Suciu. 2019. Integrity Constraints Revisited: From Exact to Approximate Implication. arXiv preprint arXiv:1812.09987(2019).Google Scholar
Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. 2005. TinyDB: an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS) 30, 1 (2005), 122–173. Google ScholarDigital Library
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD). ACM, 865–882. Google ScholarDigital Library
Zelda Mariet, Rachael Harding, Sam Madden, 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (2016).Google Scholar
Nikita Patel and Saurabh Upadhyay. 2012. Study of Various Decision Tree Pruning Methods with their Empirical Comparison in WEKA. Int. J. Comput. Appl. 60 (12 2012), 20–25. https://doi.org/10.5120/9744-4304Google ScholarCross Ref
Saharon Rosset, Claudia Perlich, Grzergorz Świrszcz, Prem Melville, and Yan Liu. 2010. Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery 20, 3 (2010), 439–468. Google ScholarDigital Library
Oliver Schulte, Yejia Liu, and Chao Li. 2018. Model Trees for Identifying Exceptional Players in the NHL Draft. arXiv preprint arXiv:1802.08765(2018).Google Scholar
Pei Wang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. In International Conference on Management of Data (SIGMOD). Google ScholarDigital Library
Larry Wasserman. 2013. All of statistics: a concise course in statistical inference. Springer Science & Business Media. Google ScholarDigital Library
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553–564. https://doi.org/10.14778/2536354.2536356 Google ScholarDigital Library
Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. ACM. Google ScholarDigital Library

Recommendations

Detection and Compensation of Amplitude Error and Quadrature Error for Inductosyn
ICMTMA '10: Proceedings of the 2010 International Conference on Measuring Technology and Mechatronics Automation - Volume 01

This paper analyzes the influence of amplitude error and quadrature error of Inductosyn on the measurement accuracy in detail,researches the detection and correction method of the two kinds of error, and proposes a multi-position error detection method. ...
Read More
Memory Package Error Detection and Correction

Single error correcting-double error detecting (SEC-DED) codes have been successfully used in computer memories for reliability. In the present-day technology of very large scale integration storage arrays bit error correction as well as byte error ...
Read More
BCH 2-Bit and 3-Bit Error Correction with Fast Multi-Bit Error Detection
Architecture of Computing Systems
Abstract
In this paper an new approach combining 2-bit and 3-bit BCH error correction with fast and simple error detection for errors of higher order is presented. Under the assumption that a 2-bit error or 3-bit error occurred, the corresponding ... $_{}_{}$ $_{}_{}_{}$ $_{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data Management
June 2021
44 pages
ISBN:9781450385350
DOI:10.1145/3464509
Editors:
Rajesh Bordawekar
IBM T. J. Watson Research Center
,
Yael Amsterdamer
Department of Computer Science, Bar-Ilan University
,
Oded Shmueli
Technion
,
Nesime Tatbul
MIT and Intel Labs
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate19of26submissions,73%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 112
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Leveraging Approximate Constraints for Localized Data Error Detection

aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data Management

ABSTRACT

References

Cited By

Recommendations

Detection and Compensation of Amplitude Error and Quadrature Error for Inductosyn

Memory Package Error Detection and Correction

BCH 2-Bit and 3-Bit Error Correction with Fast Multi-Bit Error Detection