ABSTRACT
Databases typically contain many cross-field constraints which must be validated for each entry. Census databases can contain tens of millions of such entries. Finding records which fail the constraints without suggesting a correction is set inclusion. Error-localisation coupled with imputation is the task of finding a minimal correction to a failing record so that it satisfies all constraints. The error-localisation problem alone is intractable since it is known to be NP-complete. The traditional method for solving both problems is due to Fellegi and Holt but current tools based on Fellegi-Holt require cumbersome calibrations and do not scale up to handle millions of records with hundreds of edits in the times required by national statistical agencies. In 2001, Bruni and Sassano suggested that both problems could be recast as simple propositional satisfiability problems, without using the Fellegi-Holt method. We describe how we tailered Microsoft's Z3 to handle error localisation and imputation. Experiments show that our prototype can handle a realisitic scenario from the Australian Bureau of Statistics (ABS) national census data in less than 24 hours. Thus efficient error localisation and imputation for census data is now feasible using state-of-the-art SAT/SMT solvers.
- J. A. Alonso-Jiménez, J. Borrego-Díaz, and A. M. Chávez-González. Logic Databases and Inconsistency Handling., 2005.Google Scholar
- J. A. Alonso-Jiménez, J. Borrego-Díaz, A. M. Chávez-González, M. A. Gutiérrez-Naranjo, and J. D. Navarro-Marín. Towards a practical argumentative reasoning with qualitative spatial databases. In Developments in Applied Artificial Intelligence, pages 789--798. Springer, 2003. Google ScholarDigital Library
- J. A. Alonso-Jiménez, J. Borrego-Diaz, A. M. Chavez-Gonzalez, and F. J. Martin-Mateos. Foundational challenges in automated semantic web data and ontology cleaning. Intelligent Systems, IEEE, 21(1):42--52, 2006. Google ScholarDigital Library
- G. Barcaroli. A formal logic approach to the problem of verification and correction of statistical data (Un approccio logico formale al problema del controllo e della correzione dei dati statistici). ISTAT, 1993.Google Scholar
- G. Barcaroli and M. Venturi. The probabilistic approach to automatic edit and imputation: improvements of the Fellegi-Holt methodology. Quaderni di Ricerca, 4:1997, 1997.Google Scholar
- R. J. Bayardo Jr and R. Schrag. Using CSP lookback techniques to solve real-world SAT instances. In AAAI/IAAI, pages 203--208, 1997. Google ScholarDigital Library
- A. Boskovitz. Data Editing and Logic: The covering set method from the perspective of logic. PhD thesis, Australian National University, 2008.Google Scholar
- A. Boskovitz, R. Goré, and M. Hegland. A logical formalisation of the Fellegi-Holt method of data cleaning. In Advances in Intelligent Data Analysis V, pages 554--565. Springer, 2003.Google Scholar
- R. Bruni and A. Sassano. Errors detection and correction in large scale data collecting. In Advances in Intelligent Data Analysis, pages 84--94. Springer, 2001. Google ScholarDigital Library
- R. Bruni and A. Sassano. Logic and optimization techniques for an error free data collecting. Report, University of Rome "La Sapienza", 2001.Google Scholar
- J. H. Conway and R. K. Guy. Sets of natural numbers with distinct subset sums. Notices Amer. Math. Soc, 15:345, 1968.Google Scholar
- L. De Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, pages 337--340. Springer, 2008. Google ScholarDigital Library
- T. de Waal. An overview of statistical data editing. CBS, Statistics Netherlands, 2008.Google Scholar
- T. de Waal and S. Scholtus. Methods for Automatic Statistical Data Editing. In 2011 KSS International Conference on Statistics and Probability, Busan, 2011.Google Scholar
- N. Eén and N. Sorensson. Translating pseudo-boolean constraints into SAT. Journal on Satisfiability, Boolean Modeling and Computation, 2:1--26, 2006.Google ScholarCross Ref
- W. Fan. Dependencies revisited for improving data quality. In Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 159--170. ACM, 2008. Google ScholarDigital Library
- I. P. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. Journal of the American Statistical association, 71(353):17--35, 1976.Google ScholarCross Ref
- W. McCune. Release of Prover9. In Mile High Conference on Quasigroups, Loops and Nonassociative Systems, Denver, Colorado, 2005.Google Scholar
- W. W. McCune. Otter 3.0 reference manual and guide, volume 9700. Argonne National Laboratory Argonne, USA, 1994.Google Scholar
- R. G. Michael and S. J. David. Computers and intractability: a guide to the theory of NP-completeness. WH Freeman & Co., San Francisco, 1979. Google ScholarDigital Library
- J. Pannekoek, S. Scholtus, and M. van der Loo. Automated and manual data editing: a view on process design and methodology. Journal of Official Statistics, 29(4):511--537, 2013.Google ScholarCross Ref
- V. Raman, B. Ravikumar, and S. S. Rao. A simplified NP-complete MAXSAT problem. Information Processing Letters, 65(1):1--6, 1998. Google ScholarDigital Library
- A. Riazanov and A. Voronkov. Vampire 1.1. In Automated Reasoning, pages 376--380. Springer, 2001. Google ScholarDigital Library
- J. A. Robinson. A machine-oriented logic based on the resolution principle. Journal of the ACM (JACM), 12(1):23--41, 1965. Google ScholarDigital Library
- S. Scholtus. Automatic editing with hard and soft edits. Survey Methodology, 39:59--89, 2013.Google Scholar
- B. M. Smith and M. E. Dyer. Locating the phase transition in binary constraint satisfaction problems. Artificial Intelligence, 81(1):155--181, 1996.Google ScholarCross Ref
- W. E. Winkler. Data Quality: Automated Edit/Imputation and Record Linkage. Statistics, page 7, 2006.Google Scholar
Recommendations
Learning SMT(LRA) constraints using SMT solvers
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial IntelligenceWe introduce the problem of learning SMT(LRA) constraints from data. SMT(LRA) extends propositional logic with (in)equalities between numerical variables. Many relevant formal verification problems can be cast as SMT(LRA) instances and SMT(LRA) has ...
Practical SMT-based type error localization
ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional ProgrammingCompilers for statically typed functional programming languages are notorious for generating confusing type error messages. When the compiler detects a type error, it typically reports the program location where the type checking failed as the source ...
An efficient SMT solver for string constraints
An increasing number of applications in verification and security rely on or could benefit from automatic solvers that can check the satisfiability of constraints over a diverse set of data types that includes character strings. Until recently, ...
Comments