Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Missier, P.; Lalk, G.; Verykios, V.; Grillo, F.; Lorusso, T.; Angeletti, P.

doi:10.1023/A:1021548024224

Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Published: March 2003

Volume 13, pages 135–160, (2003)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

P. Missier¹,
G. Lalk¹,
V. Verykios²,
F. Grillo³,
T. Lorusso³ &
…
P. Angeletti⁴

383 Accesses
17 Citations
Explore all metrics

Abstract

Assessing and improving the quality of data stored in information systems are both important and difficult tasks. For an increasing number of companies that rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment aimed at preserving the value of those assets. For a public administration or a government, good data quality translates into good service and good relationships with the citizens. Achieving high quality standards, however, is a major task because of the variety of ways that errors might be introduced in a system, and the difficulty of correcting them in a systematic way. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exemplified by missing, obsolete and incorrect data values and outliers.

In this paper, we describe a real-life case study on assessing and improving the quality of the data in the Italian Public Administration. The domain of study is set on taxpayer's data maintained by the Italian Ministry of Finances. In this context, we provide the Administration with a quantitative reckoning of such specific problems as record duplication and address mismatch and obsolescence, we suggest a set of guidelines for setting precise quality improvement goals, and we illustrate analysis techniques for achieving those goals. Our guidelines emphasize the importance of data flow analysis and of the definition of measurable quality indicators. The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. Finally, we investigate ways to partially automate the analysis of the causes for poor data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data Quality Management: An Overview of Methods and Challenges

Quality Assurance of Data

The Data Quality Framework for the Estonian Public Sector and Its Evaluation

References

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press, 1996.
D. Bitton and D.J. Witt, “Duplicate record elimination in large data files,” ACM Transactions on Database Systems, vol. 8, no. 2, pp. 255–265, 1983.
Google Scholar
F. Caruso, M. Cochinwala, U. Ganapathy, G. Lalk, and P. Missier, Demonstration of Telcordia's Database Reconciliation and Data Quality Analysis Tool, Poster presentation, VLDB, Cairo, Egypt, Sept. 2000.
P. Cheeseman and J. Stutz, “Bayesian classification (auto class): Theory and results,” in Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press, 1996, pp. 153–180.
M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, Efficient Data Reconciliation, Bellcore Research, 1998.
I.P. Fellegi and A.B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, pp. 1183–1210, 1969.
Google Scholar
H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “An extensible framework for data cleaning,” in Procs. EDBT, 1999.
M.A. Hernadez and S.J. Stolfo, “The merge-purge problem for large databases,” in Proc. of the 1995 ACM SIGMOD Conference, 1995, pp. 127–138.
M.A. Hernadez and S.J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.
M. Jarke, M.A. Jeusfeld, C. Quix, and P. Vassiliadis, “Architecture and quality in datawarehouses: An extended repository approach,” Information Systems, vol. 24, no. 3, pp. 229–253, 1999.
Google Scholar
M.A, Jaro, “UNIMATCH: A Record Linkage System, User's Manual,” Washington, DC, U.S. Bureau of the Census, 1976
Google Scholar
M. Kubat, I. Bratko, and R. Michalski, Machine Learning and Data Mining, Methods and Applications, John Wiley: New York, 1998.
Google Scholar
A.E. Monge and C.P. Elkan, “Anefficient domain-independent algorithm for detecting approximately duplicate database records,” in Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
A. Motro and I. Rakov, “Not all answers are equally good: Estimating the quality of database answers,” in Flexible Query-Answering Systems, T. Andreasen et al. (Eds.), Kluwer Academic Publishers: Dordrecht, 1997, pp. 1–21.
Google Scholar
H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James, “Automatic linkage of vital records,” Science, vol. 130, pp. 954–959, October 1959.
Google Scholar
D. Quass, “A framework for research in data cleaning,” Draft, 1999, Brigham Young University.
R. Quinlan, C4.5--Programs for Machine Learning, Morgan Kauffman: San Mateo, CA, 1993.
Google Scholar
V. Raman and J.M. Hellerstein, “Potter's wheel: An interactive framework for data cleaning and transformation,” University of California, Berkeley, 2000, Submitted, SIGMOD.
Google Scholar
G.K. Tayi and D.P. Ballou, “Examining data quality,” Communications of the ACM, vol. 41, no. 2, pp. 54–57, 1998.
Google Scholar
A. Umar, G. Karabatis, L. Ness, B. Horowitz, and A. Elmagarmid, “Enterprise data quality: A pragmatic approach,” Information Systems Frontiers, vol. 1, no. 3, pp. 279–301.
P. Vassiliadis, M. Bouzeghoub, and C. Quix, “Towards quality-oriented data warehouse usage and evolution,” Information Systems, vol. 25, no. 2, pp. 89–115, 2000.
Google Scholar
V.S. Verykios, A.K. Elmagarmid, M. Elfeky, M. Cochinwala, and S. Dalal, “On the completeness and accuracy of the record matching process,” in Proceedings of the 2000 Conference on Information Quality, October 2000, Boston, MA, pp. 54–69.
V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the approximate record matching process,” Journal of Information Sciences, vol. 126, nos. 1–4, pp. 83–98, 2000.
Google Scholar
Y. Wand and R.Y. Wang, “Anchoring data quality dimensions in ontological foundations, Communications of the ACM, vol. 39, no. 11, pp. 86–95, 1996.
Google Scholar
R.Y. Wang and H.B. Kon, “Towards total data quality management (TDQM),” in Information Technology in Action: Trends and Perspectives, R.Y. Wang (Ed.), Prentice Hall: Englewood Cliffs, NJ, 1993.
Google Scholar
R.Y. Wang, V.C. Storey, and C.P. Firth, “Aframework for analysis of data quality research,” IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 4, pp. 623–640, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Applied Research, Telcordia Technologies, Morristown, NJ, USA
P. Missier & G. Lalk
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
V. Verykios
Italian Ministry of Finances, Italy
F. Grillo & T. Lorusso
SO.GE.I, Roma, Italy
P. Angeletti

Authors

P. Missier
View author publications
You can also search for this author in PubMed Google Scholar
G. Lalk
View author publications
You can also search for this author in PubMed Google Scholar
V. Verykios
View author publications
You can also search for this author in PubMed Google Scholar
F. Grillo
View author publications
You can also search for this author in PubMed Google Scholar
T. Lorusso
View author publications
You can also search for this author in PubMed Google Scholar
P. Angeletti
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Missier, P., Lalk, G., Verykios, V. et al. Improving Data Quality in Practice: A Case Study in the Italian Public Administration. Distributed and Parallel Databases 13, 135–160 (2003). https://doi.org/10.1023/A:1021548024224

Download citation

Issue Date: March 2003
DOI: https://doi.org/10.1023/A:1021548024224

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Abstract

Access this article

Similar content being viewed by others

Data Quality Management: An Overview of Methods and Challenges

Quality Assurance of Data

The Data Quality Framework for the Estonian Public Sector and Its Evaluation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Abstract

Access this article

Similar content being viewed by others

Data Quality Management: An Overview of Methods and Challenges

Quality Assurance of Data

The Data Quality Framework for the Estonian Public Sector and Its Evaluation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation