Abstract
Assessing and improving the quality of data stored in information systems are both important and difficult tasks. For an increasing number of companies that rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment aimed at preserving the value of those assets. For a public administration or a government, good data quality translates into good service and good relationships with the citizens. Achieving high quality standards, however, is a major task because of the variety of ways that errors might be introduced in a system, and the difficulty of correcting them in a systematic way. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exemplified by missing, obsolete and incorrect data values and outliers.
In this paper, we describe a real-life case study on assessing and improving the quality of the data in the Italian Public Administration. The domain of study is set on taxpayer's data maintained by the Italian Ministry of Finances. In this context, we provide the Administration with a quantitative reckoning of such specific problems as record duplication and address mismatch and obsolescence, we suggest a set of guidelines for setting precise quality improvement goals, and we illustrate analysis techniques for achieving those goals. Our guidelines emphasize the importance of data flow analysis and of the definition of measurable quality indicators. The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. Finally, we investigate ways to partially automate the analysis of the causes for poor data quality.
Similar content being viewed by others
References
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press, 1996.
D. Bitton and D.J. Witt, “Duplicate record elimination in large data files,” ACM Transactions on Database Systems, vol. 8, no. 2, pp. 255–265, 1983.
F. Caruso, M. Cochinwala, U. Ganapathy, G. Lalk, and P. Missier, Demonstration of Telcordia's Database Reconciliation and Data Quality Analysis Tool, Poster presentation, VLDB, Cairo, Egypt, Sept. 2000.
P. Cheeseman and J. Stutz, “Bayesian classification (auto class): Theory and results,” in Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press, 1996, pp. 153–180.
M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, Efficient Data Reconciliation, Bellcore Research, 1998.
I.P. Fellegi and A.B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, pp. 1183–1210, 1969.
H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “An extensible framework for data cleaning,” in Procs. EDBT, 1999.
M.A. Hernadez and S.J. Stolfo, “The merge-purge problem for large databases,” in Proc. of the 1995 ACM SIGMOD Conference, 1995, pp. 127–138.
M.A. Hernadez and S.J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.
M. Jarke, M.A. Jeusfeld, C. Quix, and P. Vassiliadis, “Architecture and quality in datawarehouses: An extended repository approach,” Information Systems, vol. 24, no. 3, pp. 229–253, 1999.
M.A, Jaro, “UNIMATCH: A Record Linkage System, User's Manual,” Washington, DC, U.S. Bureau of the Census, 1976
M. Kubat, I. Bratko, and R. Michalski, Machine Learning and Data Mining, Methods and Applications, John Wiley: New York, 1998.
A.E. Monge and C.P. Elkan, “Anefficient domain-independent algorithm for detecting approximately duplicate database records,” in Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
A. Motro and I. Rakov, “Not all answers are equally good: Estimating the quality of database answers,” in Flexible Query-Answering Systems, T. Andreasen et al. (Eds.), Kluwer Academic Publishers: Dordrecht, 1997, pp. 1–21.
H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James, “Automatic linkage of vital records,” Science, vol. 130, pp. 954–959, October 1959.
D. Quass, “A framework for research in data cleaning,” Draft, 1999, Brigham Young University.
R. Quinlan, C4.5--Programs for Machine Learning, Morgan Kauffman: San Mateo, CA, 1993.
V. Raman and J.M. Hellerstein, “Potter's wheel: An interactive framework for data cleaning and transformation,” University of California, Berkeley, 2000, Submitted, SIGMOD.
G.K. Tayi and D.P. Ballou, “Examining data quality,” Communications of the ACM, vol. 41, no. 2, pp. 54–57, 1998.
A. Umar, G. Karabatis, L. Ness, B. Horowitz, and A. Elmagarmid, “Enterprise data quality: A pragmatic approach,” Information Systems Frontiers, vol. 1, no. 3, pp. 279–301.
P. Vassiliadis, M. Bouzeghoub, and C. Quix, “Towards quality-oriented data warehouse usage and evolution,” Information Systems, vol. 25, no. 2, pp. 89–115, 2000.
V.S. Verykios, A.K. Elmagarmid, M. Elfeky, M. Cochinwala, and S. Dalal, “On the completeness and accuracy of the record matching process,” in Proceedings of the 2000 Conference on Information Quality, October 2000, Boston, MA, pp. 54–69.
V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the approximate record matching process,” Journal of Information Sciences, vol. 126, nos. 1–4, pp. 83–98, 2000.
Y. Wand and R.Y. Wang, “Anchoring data quality dimensions in ontological foundations, Communications of the ACM, vol. 39, no. 11, pp. 86–95, 1996.
R.Y. Wang and H.B. Kon, “Towards total data quality management (TDQM),” in Information Technology in Action: Trends and Perspectives, R.Y. Wang (Ed.), Prentice Hall: Englewood Cliffs, NJ, 1993.
R.Y. Wang, V.C. Storey, and C.P. Firth, “Aframework for analysis of data quality research,” IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 4, pp. 623–640, 1995.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Missier, P., Lalk, G., Verykios, V. et al. Improving Data Quality in Practice: A Case Study in the Italian Public Administration. Distributed and Parallel Databases 13, 135–160 (2003). https://doi.org/10.1023/A:1021548024224
Issue Date:
DOI: https://doi.org/10.1023/A:1021548024224