research-article

SCODED: Statistical Constraint Oriented Data Error Detection

Authors:

Jing Nathan Yan,

Oliver Schulte,

Reynold ChengAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 845 - 860

https://doi.org/10.1145/3318464.3380568

Published: 31 May 2020 Publication History

Abstract

Statistical Constraints (SCs) play an important role in statistical modeling and analysis. This paper brings the concept to data cleaning and studies how to leverage SCs for error detection. SCs provide a novel approach that has various application scenarios and works harmoniously with downstream statistical modeling. Entailment relationships between SCs and integrity constraints provide analytical insight into SCs. We develop SCODED, an SC-Oriented Data Error Detection system, comprising two key components: (1) SC Violation Detection : checks whether an SC is violated on a given dataset, and (2) Error Drill Down : identifies the top-k records that contribute most to the violation of an SC. Experiments on synthetic and real-world data show that SCs are effective in detecting data errors that violate them, compared to state-of-the-art approaches.

Supplementary Material

MP4 File (3318464.3380568.mp4)

Presentation Video

Download
112.67 MB

References

[1]

2008. Pandas.DataFrame.Corr API. https://pandas.pydata.org/pandasdocs/ stable/reference/api/pandas.DataFrame.corr.html. (2008).

[2]

2010. OpenRefine. (2010). http://openrefine.org

[3]

2012. Trifacta. (2012). https://www.trifacta.com

[4]

2018. Working With Data and Machine Learning in Advertising. https://soundcloud.com/talkingmachines/episode-thirteen-workingwith-data-and-machine-learning-in-advertising. (2018).

[5]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993--1004.

Digital Library

[6]

Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Vol. 10. Morgan & Claypool Publishers. 1--154 pages.

[7]

Ziawasch Abedjan, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2014. Detecting unique column combinations on dynamic data. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1036--1047.

[8]

Paulo RL Almeida, Luiz S Oliveira, Alceu S Britto Jr, and Robert Sabourin. 2018. Adapting dynamic classifier selection for concept drift. Expert Systems with Applications 104 (2018), 67--85.

[9]

Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-Oriented Data Cleaning with Oracles. In ACM SIGMOD. 1199--1214.

[10]

Philip Bohannon, Michael Flaster,Wenfei Fan, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In SIGMOD. 143--154.

[11]

David Maxwell Chickering and Christopher Meek. 2002. Finding Optimal Bayesian Networks. In UAI. 94--102.

[12]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.

Digital Library

[13]

Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498-- 1509.

Digital Library

[14]

Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, 458--469.

Digital Library

[15]

Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1247--1261.

Digital Library

[16]

Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. 2006. Probabilistic networks and expert systems: Exact computational methods for Bayesian networks. Springer Science & Business Media.

[17]

Christophe Croux and Catherine Dehon. 2010. Influence functions of the Spearman and Kendall correlation measures. Statistical methods & applications 19, 4 (2010), 497--515.

[18]

Kaustav Das, Jeff Schneider, and Daniel B Neill. 2008. Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 169--176.

Digital Library

[19]

Kaustav Das and Jeff G. Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12--15, 2007. 220--229. https://doi.org/10.1145/1281192.1281219

Digital Library

[20]

A Philip Dawid. 1979. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological) (1979), 1--31.

[21]

Ryan Elwell and Robi Polikar. 2011. Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks 22, 10 (2011), 1517--1531.

Digital Library

[22]

Ronald Fagin. 1977. Multivalued dependencies and a new normal form for relational databases. ACM Transactions on Database Systems (TODS) 2, 3 (1977), 262--278.

Digital Library

[23]

Gregory A Fredricks and Roger B Nelsen. 2007. On the relationship between Spearman's rho and Kendall's tau for pairs of continuous random variables. Journal of statistical planning and inference 137, 7 (2007), 2143--2150.

[24]

Dan Geiger and Judea Pearl. 1993. Logical and algorithmic properties of conditional independence and graphical models. The Annals of Statistics (1993), 2001--2021.

[25]

Dan Geiger, Thomas Verma, and Judea Pearl. 1990. d-separation: From theorems to algorithms. In Machine Intelligence and Pattern Recognition. Vol. 10. Elsevier, 139--148.

[26]

David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81--102.

[27]

Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data. 829--846.

Digital Library

[28]

Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008).

[29]

David C Howell. 2009. Statistical methods for psychology. Cengage Learning.

[30]

Zhipeng Huang and Yeye He. 2018. Auto-Detect: Data-Driven Error Detection in Tables. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of data. ACM.

Digital Library

[31]

Ihab F. Ilyas and Xu Chu. 2015. Trends in Cleaning Relational Data: Consistency and Deduplication. Foundations and Trends in Databases 5, 4 (2015), 281--393. https://doi.org/10.1561/1900000045

Digital Library

[32]

Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 647--658.

Digital Library

[33]

Shawn R Jeffery, Gustavo Alonso, Michael J Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative support for sensor data cleaning. In International Conference on Pervasive Computing. Springer, 83--100.

Digital Library

[34]

Batya Kenig and Dan Suciu. 2019. Integrity Constraints Revisited: From Exact to Approximate Implication. arXiv preprint arXiv:1812.09987 (2019).

[35]

Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2006. Towards correcting input data errors probabilistically using integrity constraints. In Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access. ACM, 43--50.

Digital Library

[36]

William R Knight. 1966. A computer method for calculating Kendall's tau with ungrouped data. J. Amer. Statist. Assoc. 61, 314 (1966), 436--439.

[37]

Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings. 53--62.

[38]

Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Jiannan Wang, and Eugene Wu. 2016. Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2117--2120.

Digital Library

[39]

Sanjay Krishnan, Jiannan Wang, Michael J Franklin, Ken Goldberg, Tim Kraska, Tova Milo, and Eugene Wu. 2015. Sample Clean: Fast and Reliable Analytics on Dirty Data. IEEE Data Eng. Bull. 38, 3 (2015), 59--75.

[40]

Yejia Liu, Oliver Schulte, and Chao Li. 2018. Model Trees for Identifying Exceptional Players in the NHL and NBA Drafts. In International Workshop on Machine Learning and Data Mining for Sports Analytics. Springer, 93--105.

[41]

Brian Macdonald. 2011. A Regression-Based Adjusted Plus-Minus Statistic for NHL Players. Journal of Quantitative Analysis in Sports 7, 3 (2011), 29.

[42]

Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. 2005. TinyDB: an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS) 30, 1 (2005), 122--173.

[43]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In Proceedings of the 2019 International Conference on Management of Data. ACM, 865--882.

Digital Library

[44]

Panagiotis Mandros, Mario Boley, and Jilles Vreeken. 2017. Discovering reliable approximate functional dependencies. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 355--363.

Digital Library

[45]

Dimitris Margaritis. 2003. Learning Bayesian network model structure from data. Technical Report. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.

[46]

Zelda Mariet, Rachael Harding, Sam Madden, et al. 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (2016).

[47]

Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 75--86.

Digital Library

[48]

Mathias Niepert, Marc Gyssens, Bassem Sayrafi, and Dirk Van Gucht. 2013. On the conditional independence implication problem: A lattice theoretic approach. Artificial Intelligence 202 (2013), 29--51.

Digital Library

[49]

J. Pearl. 2000. Causality: Models, Reasoning, and Inference. Cambridge university press.

Digital Library

[50]

Judea Pearl. 2014. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier.

Digital Library

[51]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201. http://www.vldb.org/pvldb/vol10/p1190-rekatsinas.pdf

Digital Library

[52]

Fatemeh Riahi and Oliver Schulte. 2015. Model-based outlier detection for object-relational data. In Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 1590--1598.

[53]

Saharon Rosset, Claudia Perlich, Grzergorz ?wirszcz, Prem Melville, and Yan Liu. 2010. Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery 20, 3 (2010), 439--468.

Digital Library

[54]

Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In SIGMOD. 1579--1590.

[55]

Cory J. Butz S. K. MichaelWong and DanWu. 2000. On the implication problem for probabilistic conditional independency. In IEEE Trans. Systems, Man, and Cybernetics, Part A. 30(6):785--805.

[56]

Babak Salimi, Corey Cole, Peter Li, Johannes Gehrke, and Dan Suciu. 2018. HypDB: a demonstration of detecting, explaining and resolving bias in OLAP queries. Proceedings of the VLDB Endowment 11, 12 (2018), 2062--2065.

Digital Library

[57]

Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in OLAP Queries: Detection, Explanation, and Removal. In ACM SIGMOD. 1021-- 1035.

[58]

Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin: Causal database repair for algorithmic fairness. arXiv preprint arXiv:1902.08283 (2019).

[59]

Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 793--810. https://doi.org/10.1145/3299869.3319901

Digital Library

[60]

Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Felix Naumann, Dennis Hempfing, Torben Mayer, and Daniel Neuschäfer-Rube. 2019. DynFD: Functional Dependency Discovery in Dynamic Datasets. In EDBT. 253--264.

[61]

Milan Studeny. 1990. Conditional independence relations have no finite complete characterization. (1990).

[62]

PeiWang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. In Proceedings of the 2019 International Conference on Management of Data. ACM, 811--828.

[63]

Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data XRay: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1231--1245. https://doi.org/ 10.1145/2723372.2750549

Digital Library

[64]

Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.

Digital Library

[65]

Larry Wasserman. 2013. All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Digital Library

[66]

Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.

Digital Library

[67]

Weichao Xu, Yunhe Hou, YS Hung, and Yuexian Zou. 2013. A comparative analysis of Spearman's rho and Kendall's tau in normal and contaminated normal models. Signal Processing 93, 1 (2013), 261--276.

Digital Library

[68]

Mohamed Yakout, Laure Berti-Équille, and Ahmed K Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 553--564.

Digital Library

[69]

Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. http://tiny.cc/sigmod2020-scoded. Technical Report (2020).

Cited By

Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Ding XLi GWang HWang CSong Y(2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00283
Zhu JZhao XSun YSong SYuan X(2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
https://doi.org/10.1007/s41019-024-00266-7
Show More Cited By

Index Terms

SCODED: Statistical Constraint Oriented Data Error Detection
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

BCH 2-Bit and 3-Bit Error Correction with Fast Multi-Bit Error Detection
Architecture of Computing Systems
Abstract
In this paper an new approach combining 2-bit and 3-bit BCH error correction with fast and simple error detection for errors of higher order is presented. Under the assumption that a 2-bit error or 3-bit error occurred, the corresponding ... $_{}_{}$ $_{}_{}_{}$ $_{}$
Theory and Design of t-Error Correcting, k-Error Detecting and d-Unidirectional Error Detecting Codes with d

The fundamental theory of t-error correcting, k-error detecting, and d-unidirectional error detecting codes with d
Error Detection and Correction by Product Codes in Residue Number Systems

The arithmetic error detecting and correcting capabilities of product (AN) codes in residue number systems (RNS) are described. The redundancy necessary and sufficient to allow single residue digit error detection or correction is determined, under the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
543
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)4

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Ding XLi GWang HWang CSong Y(2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00283
Zhu JZhao XSun YSong SYuan X(2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
https://doi.org/10.1007/s41019-024-00266-7
Zhang YQin JMao RJi YWang YAli M(2024)Adaptive Label Cleaning for Error Detection on Tabular DataWeb and Big Data10.1007/978-981-97-2421-5_5(63-78)Online publication date: 12-May-2024
https://doi.org/10.1007/978-981-97-2421-5_5
Liu XWang SSun MPan SLi GJha SYan CYang JLu SCheung A(2023)Leveraging Application Data Constraints to Optimize Database-Backed Web ApplicationsProceedings of the VLDB Endowment10.14778/3583140.358314116:6(1208-1221)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583141
Tu DHe YCui WGe SZhang HHan SZhang DChaudhuri SSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data PipelinesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599776(4991-5003)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599776
Guan SMa HWang MWu Y(2023)GALE: Active Adversarial Learning for Erroneous Node Detection in Graphs2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00134(1705-1718)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00134
Zhang YQin JWang YAli MJi YMao R(2023)TabMentor: Detect Errors on Tabular Data with Noisy LabelsAdvanced Data Mining and Applications10.1007/978-3-031-46671-7_12(167-182)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-46671-7_12
Pena Ede Almeida ENaumann F(2022)Fast detection of denial constraint violationsProceedings of the VLDB Endowment10.14778/3503585.350359515:4(859-871)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503595
Liu YWu WFlokas LWang JWu E(2022)Enabling SQL-based training data debugging for federated learningProceedings of the VLDB Endowment10.14778/3494124.349412515:3(388-400)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3494124.3494125
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten