skip to main content
10.1145/2843043.2843052acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesaus-cswConference Proceedingsconference-collections
research-article

Efficient error localisation and imputation for real-world census data using SMT

Published:01 February 2016Publication History

ABSTRACT

Databases typically contain many cross-field constraints which must be validated for each entry. Census databases can contain tens of millions of such entries. Finding records which fail the constraints without suggesting a correction is set inclusion. Error-localisation coupled with imputation is the task of finding a minimal correction to a failing record so that it satisfies all constraints. The error-localisation problem alone is intractable since it is known to be NP-complete. The traditional method for solving both problems is due to Fellegi and Holt but current tools based on Fellegi-Holt require cumbersome calibrations and do not scale up to handle millions of records with hundreds of edits in the times required by national statistical agencies. In 2001, Bruni and Sassano suggested that both problems could be recast as simple propositional satisfiability problems, without using the Fellegi-Holt method. We describe how we tailered Microsoft's Z3 to handle error localisation and imputation. Experiments show that our prototype can handle a realisitic scenario from the Australian Bureau of Statistics (ABS) national census data in less than 24 hours. Thus efficient error localisation and imputation for census data is now feasible using state-of-the-art SAT/SMT solvers.

References

  1. J. A. Alonso-Jiménez, J. Borrego-Díaz, and A. M. Chávez-González. Logic Databases and Inconsistency Handling., 2005.Google ScholarGoogle Scholar
  2. J. A. Alonso-Jiménez, J. Borrego-Díaz, A. M. Chávez-González, M. A. Gutiérrez-Naranjo, and J. D. Navarro-Marín. Towards a practical argumentative reasoning with qualitative spatial databases. In Developments in Applied Artificial Intelligence, pages 789--798. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. A. Alonso-Jiménez, J. Borrego-Diaz, A. M. Chavez-Gonzalez, and F. J. Martin-Mateos. Foundational challenges in automated semantic web data and ontology cleaning. Intelligent Systems, IEEE, 21(1):42--52, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Barcaroli. A formal logic approach to the problem of verification and correction of statistical data (Un approccio logico formale al problema del controllo e della correzione dei dati statistici). ISTAT, 1993.Google ScholarGoogle Scholar
  5. G. Barcaroli and M. Venturi. The probabilistic approach to automatic edit and imputation: improvements of the Fellegi-Holt methodology. Quaderni di Ricerca, 4:1997, 1997.Google ScholarGoogle Scholar
  6. R. J. Bayardo Jr and R. Schrag. Using CSP lookback techniques to solve real-world SAT instances. In AAAI/IAAI, pages 203--208, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Boskovitz. Data Editing and Logic: The covering set method from the perspective of logic. PhD thesis, Australian National University, 2008.Google ScholarGoogle Scholar
  8. A. Boskovitz, R. Goré, and M. Hegland. A logical formalisation of the Fellegi-Holt method of data cleaning. In Advances in Intelligent Data Analysis V, pages 554--565. Springer, 2003.Google ScholarGoogle Scholar
  9. R. Bruni and A. Sassano. Errors detection and correction in large scale data collecting. In Advances in Intelligent Data Analysis, pages 84--94. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Bruni and A. Sassano. Logic and optimization techniques for an error free data collecting. Report, University of Rome "La Sapienza", 2001.Google ScholarGoogle Scholar
  11. J. H. Conway and R. K. Guy. Sets of natural numbers with distinct subset sums. Notices Amer. Math. Soc, 15:345, 1968.Google ScholarGoogle Scholar
  12. L. De Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, pages 337--340. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. de Waal. An overview of statistical data editing. CBS, Statistics Netherlands, 2008.Google ScholarGoogle Scholar
  14. T. de Waal and S. Scholtus. Methods for Automatic Statistical Data Editing. In 2011 KSS International Conference on Statistics and Probability, Busan, 2011.Google ScholarGoogle Scholar
  15. N. Eén and N. Sorensson. Translating pseudo-boolean constraints into SAT. Journal on Satisfiability, Boolean Modeling and Computation, 2:1--26, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. W. Fan. Dependencies revisited for improving data quality. In Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 159--170. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. P. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. Journal of the American Statistical association, 71(353):17--35, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  18. W. McCune. Release of Prover9. In Mile High Conference on Quasigroups, Loops and Nonassociative Systems, Denver, Colorado, 2005.Google ScholarGoogle Scholar
  19. W. W. McCune. Otter 3.0 reference manual and guide, volume 9700. Argonne National Laboratory Argonne, USA, 1994.Google ScholarGoogle Scholar
  20. R. G. Michael and S. J. David. Computers and intractability: a guide to the theory of NP-completeness. WH Freeman & Co., San Francisco, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Pannekoek, S. Scholtus, and M. van der Loo. Automated and manual data editing: a view on process design and methodology. Journal of Official Statistics, 29(4):511--537, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  22. V. Raman, B. Ravikumar, and S. S. Rao. A simplified NP-complete MAXSAT problem. Information Processing Letters, 65(1):1--6, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Riazanov and A. Voronkov. Vampire 1.1. In Automated Reasoning, pages 376--380. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. A. Robinson. A machine-oriented logic based on the resolution principle. Journal of the ACM (JACM), 12(1):23--41, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Scholtus. Automatic editing with hard and soft edits. Survey Methodology, 39:59--89, 2013.Google ScholarGoogle Scholar
  26. B. M. Smith and M. E. Dyer. Locating the phase transition in binary constraint satisfaction problems. Artificial Intelligence, 81(1):155--181, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  27. W. E. Winkler. Data Quality: Automated Edit/Imputation and Record Linkage. Statistics, page 7, 2006.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference
    February 2016
    654 pages
    ISBN:9781450340427
    DOI:10.1145/2843043

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 February 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    ACSW '16 Paper Acceptance Rate77of172submissions,45%Overall Acceptance Rate204of424submissions,48%
  • Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader