Skip to main content
Log in

Improving Data Quality in Practice: A Case Study in the Italian Public Administration

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Assessing and improving the quality of data stored in information systems are both important and difficult tasks. For an increasing number of companies that rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment aimed at preserving the value of those assets. For a public administration or a government, good data quality translates into good service and good relationships with the citizens. Achieving high quality standards, however, is a major task because of the variety of ways that errors might be introduced in a system, and the difficulty of correcting them in a systematic way. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exemplified by missing, obsolete and incorrect data values and outliers.

In this paper, we describe a real-life case study on assessing and improving the quality of the data in the Italian Public Administration. The domain of study is set on taxpayer's data maintained by the Italian Ministry of Finances. In this context, we provide the Administration with a quantitative reckoning of such specific problems as record duplication and address mismatch and obsolescence, we suggest a set of guidelines for setting precise quality improvement goals, and we illustrate analysis techniques for achieving those goals. Our guidelines emphasize the importance of data flow analysis and of the definition of measurable quality indicators. The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. Finally, we investigate ways to partially automate the analysis of the causes for poor data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press, 1996.

  2. D. Bitton and D.J. Witt, “Duplicate record elimination in large data files,” ACM Transactions on Database Systems, vol. 8, no. 2, pp. 255–265, 1983.

    Google Scholar 

  3. F. Caruso, M. Cochinwala, U. Ganapathy, G. Lalk, and P. Missier, Demonstration of Telcordia's Database Reconciliation and Data Quality Analysis Tool, Poster presentation, VLDB, Cairo, Egypt, Sept. 2000.

  4. P. Cheeseman and J. Stutz, “Bayesian classification (auto class): Theory and results,” in Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press, 1996, pp. 153–180.

  5. M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, Efficient Data Reconciliation, Bellcore Research, 1998.

  6. I.P. Fellegi and A.B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, pp. 1183–1210, 1969.

    Google Scholar 

  7. H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “An extensible framework for data cleaning,” in Procs. EDBT, 1999.

  8. M.A. Hernadez and S.J. Stolfo, “The merge-purge problem for large databases,” in Proc. of the 1995 ACM SIGMOD Conference, 1995, pp. 127–138.

  9. M.A. Hernadez and S.J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.

  10. M. Jarke, M.A. Jeusfeld, C. Quix, and P. Vassiliadis, “Architecture and quality in datawarehouses: An extended repository approach,” Information Systems, vol. 24, no. 3, pp. 229–253, 1999.

    Google Scholar 

  11. M.A, Jaro, “UNIMATCH: A Record Linkage System, User's Manual,” Washington, DC, U.S. Bureau of the Census, 1976

    Google Scholar 

  12. M. Kubat, I. Bratko, and R. Michalski, Machine Learning and Data Mining, Methods and Applications, John Wiley: New York, 1998.

    Google Scholar 

  13. A.E. Monge and C.P. Elkan, “Anefficient domain-independent algorithm for detecting approximately duplicate database records,” in Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

  14. A. Motro and I. Rakov, “Not all answers are equally good: Estimating the quality of database answers,” in Flexible Query-Answering Systems, T. Andreasen et al. (Eds.), Kluwer Academic Publishers: Dordrecht, 1997, pp. 1–21.

    Google Scholar 

  15. H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James, “Automatic linkage of vital records,” Science, vol. 130, pp. 954–959, October 1959.

    Google Scholar 

  16. D. Quass, “A framework for research in data cleaning,” Draft, 1999, Brigham Young University.

  17. R. Quinlan, C4.5--Programs for Machine Learning, Morgan Kauffman: San Mateo, CA, 1993.

    Google Scholar 

  18. V. Raman and J.M. Hellerstein, “Potter's wheel: An interactive framework for data cleaning and transformation,” University of California, Berkeley, 2000, Submitted, SIGMOD.

    Google Scholar 

  19. G.K. Tayi and D.P. Ballou, “Examining data quality,” Communications of the ACM, vol. 41, no. 2, pp. 54–57, 1998.

    Google Scholar 

  20. A. Umar, G. Karabatis, L. Ness, B. Horowitz, and A. Elmagarmid, “Enterprise data quality: A pragmatic approach,” Information Systems Frontiers, vol. 1, no. 3, pp. 279–301.

  21. P. Vassiliadis, M. Bouzeghoub, and C. Quix, “Towards quality-oriented data warehouse usage and evolution,” Information Systems, vol. 25, no. 2, pp. 89–115, 2000.

    Google Scholar 

  22. V.S. Verykios, A.K. Elmagarmid, M. Elfeky, M. Cochinwala, and S. Dalal, “On the completeness and accuracy of the record matching process,” in Proceedings of the 2000 Conference on Information Quality, October 2000, Boston, MA, pp. 54–69.

  23. V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the approximate record matching process,” Journal of Information Sciences, vol. 126, nos. 1–4, pp. 83–98, 2000.

    Google Scholar 

  24. Y. Wand and R.Y. Wang, “Anchoring data quality dimensions in ontological foundations, Communications of the ACM, vol. 39, no. 11, pp. 86–95, 1996.

    Google Scholar 

  25. R.Y. Wang and H.B. Kon, “Towards total data quality management (TDQM),” in Information Technology in Action: Trends and Perspectives, R.Y. Wang (Ed.), Prentice Hall: Englewood Cliffs, NJ, 1993.

    Google Scholar 

  26. R.Y. Wang, V.C. Storey, and C.P. Firth, “Aframework for analysis of data quality research,” IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 4, pp. 623–640, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Missier, P., Lalk, G., Verykios, V. et al. Improving Data Quality in Practice: A Case Study in the Italian Public Administration. Distributed and Parallel Databases 13, 135–160 (2003). https://doi.org/10.1023/A:1021548024224

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021548024224

Navigation