skip to main content
10.1145/3318464.3380568acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

SCODED: Statistical Constraint Oriented Data Error Detection

Published: 31 May 2020 Publication History

Abstract

Statistical Constraints (SCs) play an important role in statistical modeling and analysis. This paper brings the concept to data cleaning and studies how to leverage SCs for error detection. SCs provide a novel approach that has various application scenarios and works harmoniously with downstream statistical modeling. Entailment relationships between SCs and integrity constraints provide analytical insight into SCs. We develop SCODED, an SC-Oriented Data Error Detection system, comprising two key components: (1) SC Violation Detection : checks whether an SC is violated on a given dataset, and (2) Error Drill Down : identifies the top-k records that contribute most to the violation of an SC. Experiments on synthetic and real-world data show that SCs are effective in detecting data errors that violate them, compared to state-of-the-art approaches.

Supplementary Material

MP4 File (3318464.3380568.mp4)
Presentation Video

References

[1]
2008. Pandas.DataFrame.Corr API. https://pandas.pydata.org/pandasdocs/ stable/reference/api/pandas.DataFrame.corr.html. (2008).
[2]
2010. OpenRefine. (2010). http://openrefine.org
[3]
2012. Trifacta. (2012). https://www.trifacta.com
[4]
2018. Working With Data and Machine Learning in Advertising. https://soundcloud.com/talkingmachines/episode-thirteen-workingwith-data-and-machine-learning-in-advertising. (2018).
[5]
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993--1004.
[6]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Vol. 10. Morgan & Claypool Publishers. 1--154 pages.
[7]
Ziawasch Abedjan, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2014. Detecting unique column combinations on dynamic data. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1036--1047.
[8]
Paulo RL Almeida, Luiz S Oliveira, Alceu S Britto Jr, and Robert Sabourin. 2018. Adapting dynamic classifier selection for concept drift. Expert Systems with Applications 104 (2018), 67--85.
[9]
Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-Oriented Data Cleaning with Oracles. In ACM SIGMOD. 1199--1214.
[10]
Philip Bohannon, Michael Flaster,Wenfei Fan, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In SIGMOD. 143--154.
[11]
David Maxwell Chickering and Christopher Meek. 2002. Finding Optimal Bayesian Networks. In UAI. 94--102.
[12]
Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.
[13]
Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498-- 1509.
[14]
Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, 458--469.
[15]
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1247--1261.
[16]
Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. 2006. Probabilistic networks and expert systems: Exact computational methods for Bayesian networks. Springer Science & Business Media.
[17]
Christophe Croux and Catherine Dehon. 2010. Influence functions of the Spearman and Kendall correlation measures. Statistical methods & applications 19, 4 (2010), 497--515.
[18]
Kaustav Das, Jeff Schneider, and Daniel B Neill. 2008. Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 169--176.
[19]
Kaustav Das and Jeff G. Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12--15, 2007. 220--229. https://doi.org/10.1145/1281192.1281219
[20]
A Philip Dawid. 1979. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological) (1979), 1--31.
[21]
Ryan Elwell and Robi Polikar. 2011. Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks 22, 10 (2011), 1517--1531.
[22]
Ronald Fagin. 1977. Multivalued dependencies and a new normal form for relational databases. ACM Transactions on Database Systems (TODS) 2, 3 (1977), 262--278.
[23]
Gregory A Fredricks and Roger B Nelsen. 2007. On the relationship between Spearman's rho and Kendall's tau for pairs of continuous random variables. Journal of statistical planning and inference 137, 7 (2007), 2143--2150.
[24]
Dan Geiger and Judea Pearl. 1993. Logical and algorithmic properties of conditional independence and graphical models. The Annals of Statistics (1993), 2001--2021.
[25]
Dan Geiger, Thomas Verma, and Judea Pearl. 1990. d-separation: From theorems to algorithms. In Machine Intelligence and Pattern Recognition. Vol. 10. Elsevier, 139--148.
[26]
David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81--102.
[27]
Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data. 829--846.
[28]
Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008).
[29]
David C Howell. 2009. Statistical methods for psychology. Cengage Learning.
[30]
Zhipeng Huang and Yeye He. 2018. Auto-Detect: Data-Driven Error Detection in Tables. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of data. ACM.
[31]
Ihab F. Ilyas and Xu Chu. 2015. Trends in Cleaning Relational Data: Consistency and Deduplication. Foundations and Trends in Databases 5, 4 (2015), 281--393. https://doi.org/10.1561/1900000045
[32]
Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 647--658.
[33]
Shawn R Jeffery, Gustavo Alonso, Michael J Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative support for sensor data cleaning. In International Conference on Pervasive Computing. Springer, 83--100.
[34]
Batya Kenig and Dan Suciu. 2019. Integrity Constraints Revisited: From Exact to Approximate Implication. arXiv preprint arXiv:1812.09987 (2019).
[35]
Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2006. Towards correcting input data errors probabilistically using integrity constraints. In Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access. ACM, 43--50.
[36]
William R Knight. 1966. A computer method for calculating Kendall's tau with ungrouped data. J. Amer. Statist. Assoc. 61, 314 (1966), 436--439.
[37]
Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings. 53--62.
[38]
Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Jiannan Wang, and Eugene Wu. 2016. Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2117--2120.
[39]
Sanjay Krishnan, Jiannan Wang, Michael J Franklin, Ken Goldberg, Tim Kraska, Tova Milo, and Eugene Wu. 2015. Sample Clean: Fast and Reliable Analytics on Dirty Data. IEEE Data Eng. Bull. 38, 3 (2015), 59--75.
[40]
Yejia Liu, Oliver Schulte, and Chao Li. 2018. Model Trees for Identifying Exceptional Players in the NHL and NBA Drafts. In International Workshop on Machine Learning and Data Mining for Sports Analytics. Springer, 93--105.
[41]
Brian Macdonald. 2011. A Regression-Based Adjusted Plus-Minus Statistic for NHL Players. Journal of Quantitative Analysis in Sports 7, 3 (2011), 29.
[42]
Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. 2005. TinyDB: an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS) 30, 1 (2005), 122--173.
[43]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In Proceedings of the 2019 International Conference on Management of Data. ACM, 865--882.
[44]
Panagiotis Mandros, Mario Boley, and Jilles Vreeken. 2017. Discovering reliable approximate functional dependencies. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 355--363.
[45]
Dimitris Margaritis. 2003. Learning Bayesian network model structure from data. Technical Report. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.
[46]
Zelda Mariet, Rachael Harding, Sam Madden, et al. 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (2016).
[47]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 75--86.
[48]
Mathias Niepert, Marc Gyssens, Bassem Sayrafi, and Dirk Van Gucht. 2013. On the conditional independence implication problem: A lattice theoretic approach. Artificial Intelligence 202 (2013), 29--51.
[49]
J. Pearl. 2000. Causality: Models, Reasoning, and Inference. Cambridge university press.
[50]
Judea Pearl. 2014. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier.
[51]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201. http://www.vldb.org/pvldb/vol10/p1190-rekatsinas.pdf
[52]
Fatemeh Riahi and Oliver Schulte. 2015. Model-based outlier detection for object-relational data. In Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 1590--1598.
[53]
Saharon Rosset, Claudia Perlich, Grzergorz ?wirszcz, Prem Melville, and Yan Liu. 2010. Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery 20, 3 (2010), 439--468.
[54]
Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In SIGMOD. 1579--1590.
[55]
Cory J. Butz S. K. MichaelWong and DanWu. 2000. On the implication problem for probabilistic conditional independency. In IEEE Trans. Systems, Man, and Cybernetics, Part A. 30(6):785--805.
[56]
Babak Salimi, Corey Cole, Peter Li, Johannes Gehrke, and Dan Suciu. 2018. HypDB: a demonstration of detecting, explaining and resolving bias in OLAP queries. Proceedings of the VLDB Endowment 11, 12 (2018), 2062--2065.
[57]
Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in OLAP Queries: Detection, Explanation, and Removal. In ACM SIGMOD. 1021-- 1035.
[58]
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin: Causal database repair for algorithmic fairness. arXiv preprint arXiv:1902.08283 (2019).
[59]
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 793--810. https://doi.org/10.1145/3299869.3319901
[60]
Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Felix Naumann, Dennis Hempfing, Torben Mayer, and Daniel Neuschäfer-Rube. 2019. DynFD: Functional Dependency Discovery in Dynamic Datasets. In EDBT. 253--264.
[61]
Milan Studeny. 1990. Conditional independence relations have no finite complete characterization. (1990).
[62]
PeiWang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. In Proceedings of the 2019 International Conference on Management of Data. ACM, 811--828.
[63]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data XRay: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1231--1245. https://doi.org/ 10.1145/2723372.2750549
[64]
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.
[65]
Larry Wasserman. 2013. All of statistics: a concise course in statistical inference. Springer Science & Business Media.
[66]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.
[67]
Weichao Xu, Yunhe Hou, YS Hung, and Yuexian Zou. 2013. A comparative analysis of Spearman's rho and Kendall's tau in normal and contaminated normal models. Signal Processing 93, 1 (2013), 261--276.
[68]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 553--564.
[69]
Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. http://tiny.cc/sigmod2020-scoded. Technical Report (2020).

Cited By

View all
  • (2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
  • (2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
  • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
  • Show More Cited By

Index Terms

  1. SCODED: Statistical Constraint Oriented Data Error Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. error detection
    2. machine learning
    3. statistical constraints

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)62
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
    • (2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
    • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
    • (2024)Adaptive Label Cleaning for Error Detection on Tabular DataWeb and Big Data10.1007/978-981-97-2421-5_5(63-78)Online publication date: 12-May-2024
    • (2023)Leveraging Application Data Constraints to Optimize Database-Backed Web ApplicationsProceedings of the VLDB Endowment10.14778/3583140.358314116:6(1208-1221)Online publication date: 1-Feb-2023
    • (2023)Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data PipelinesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599776(4991-5003)Online publication date: 6-Aug-2023
    • (2023)GALE: Active Adversarial Learning for Erroneous Node Detection in Graphs2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00134(1705-1718)Online publication date: Apr-2023
    • (2023)TabMentor: Detect Errors on Tabular Data with Noisy LabelsAdvanced Data Mining and Applications10.1007/978-3-031-46671-7_12(167-182)Online publication date: 5-Nov-2023
    • (2022)Fast detection of denial constraint violationsProceedings of the VLDB Endowment10.14778/3503585.350359515:4(859-871)Online publication date: 14-Apr-2022
    • (2022)Enabling SQL-based training data debugging for federated learningProceedings of the VLDB Endowment10.14778/3494124.349412515:3(388-400)Online publication date: 4-Feb-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media