ABSTRACT
Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs.
Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors.
After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.
- Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where Are We and What Needs to Be Done?Proc. VLDB Endow. 9, 12 (Aug. 2016), 993–1004. https://doi.org/10.14778/2994509.2994518 Google ScholarDigital Library
- Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004. Google ScholarDigital Library
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Vol. 10. Morgan & Claypool Publishers. 1–154 pages. Google ScholarDigital Library
- Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. Classification and regression trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA. https://cds.cern.ch/record/2253780Google Scholar
- Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 2201–2206. https://doi.org/10.1145/2882903.2912574 Google ScholarDigital Library
- Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509. Google ScholarDigital Library
- Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD ’02). Association for Computing Machinery, New York, NY, USA, 240–251. https://doi.org/10.1145/564691.564719 Google ScholarDigital Library
- David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81–102.Google ScholarCross Ref
- Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 829–846. https://doi.org/10.1145/3299869.3319888 Google ScholarDigital Library
- Joseph M. Hellerstein. 2008. Quantitative Data Cleaning for Large Databases.Google Scholar
- IBM. [n.d.]. The Four V’s of Big Data.Accessed: 2020-01-15.Google Scholar
- Batya Kenig and Dan Suciu. 2019. Integrity Constraints Revisited: From Exact to Approximate Implication. arXiv preprint arXiv:1812.09987(2019).Google Scholar
- Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. 2005. TinyDB: an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS) 30, 1 (2005), 122–173. Google ScholarDigital Library
- Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD). ACM, 865–882. Google ScholarDigital Library
- Zelda Mariet, Rachael Harding, Sam Madden, 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (2016).Google Scholar
- Nikita Patel and Saurabh Upadhyay. 2012. Study of Various Decision Tree Pruning Methods with their Empirical Comparison in WEKA. Int. J. Comput. Appl. 60 (12 2012), 20–25. https://doi.org/10.5120/9744-4304Google ScholarCross Ref
- Saharon Rosset, Claudia Perlich, Grzergorz Świrszcz, Prem Melville, and Yan Liu. 2010. Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery 20, 3 (2010), 439–468. Google ScholarDigital Library
- Oliver Schulte, Yejia Liu, and Chao Li. 2018. Model Trees for Identifying Exceptional Players in the NHL Draft. arXiv preprint arXiv:1802.08765(2018).Google Scholar
- Pei Wang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. In International Conference on Management of Data (SIGMOD). Google ScholarDigital Library
- Larry Wasserman. 2013. All of statistics: a concise course in statistical inference. Springer Science & Business Media. Google ScholarDigital Library
- Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553–564. https://doi.org/10.14778/2536354.2536356 Google ScholarDigital Library
- Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. ACM. Google ScholarDigital Library
Recommendations
Detection and Compensation of Amplitude Error and Quadrature Error for Inductosyn
ICMTMA '10: Proceedings of the 2010 International Conference on Measuring Technology and Mechatronics Automation - Volume 01This paper analyzes the influence of amplitude error and quadrature error of Inductosyn on the measurement accuracy in detail,researches the detection and correction method of the two kinds of error, and proposes a multi-position error detection method. ...
Memory Package Error Detection and Correction
Single error correcting-double error detecting (SEC-DED) codes have been successfully used in computer memories for reliability. In the present-day technology of very large scale integration storage arrays bit error correction as well as byte error ...
BCH 2-Bit and 3-Bit Error Correction with Fast Multi-Bit Error Detection
Architecture of Computing SystemsAbstractIn this paper an new approach combining 2-bit and 3-bit BCH error correction with fast and simple error detection for errors of higher order is presented. Under the assumption that a 2-bit error or 3-bit error occurred, the corresponding ...
Comments