The data quality concept of accuracy in the context of publicly shared data sets

Kuchler, Carsten; Spiess, Martin

doi:10.1007/s11943-009-0056-0

The data quality concept of accuracy in the context of publicly shared data sets

Originalveröffentlichung
Published: 28 May 2009

Volume 3, pages 67–80, (2009)
Cite this article

AStA Wirtschafts- und Sozialstatistisches Archiv Aims and scope Submit manuscript

Carsten Kuchler¹ &
Martin Spiess²

66 Accesses
4 Citations
Explore all metrics

Abstract

Along with other data quality dimensions, the concept of accuracy is often used to describe the quality of a particular data set. However, its basic definition refers to the statistical properties of estimators, which can hardly be proved by means of just a single survey. This ambiguity can be resolved by assigning “accuracy” to survey processes that are known to affect these properties. In this contribution, we consider the sub-process of imputation as one important step in setting up a data set and argue that criteria like the so called “hit-rate” criterion, which is intended to measure the accuracy of a data set by some distance function of “true” but unobserved and imputed values, is neither required nor desirable. In contrast, the so-called “inference” criterion allows statements on the validity of inferences based on a suitably completed data set under rather general conditions. The underlying theoretical concepts are illustrated by means of a simulation study. It is emphasised that the same arguments apply to other survey processes that introduce uncertainty into an edited data set.

Zusammenfassung

Zur Beschreibung der Qualität eines Datensatzes wird regelmäßig der Begriff der Genauigkeit herangezogen. Alle Definitionen dieses Begriffs beziehen sich jedoch auf die Eigenschaften von Schätzern und sind nicht auf der Basis des konkreten Datensatzes rekonstruierbar. Dieser Widerspruch kann überwunden werden, indem der Begriff der Genauigkeit auf die Prozesse angewandt wird, die der Erzeugung eines Datensatzes zugrunde liegen und die die entsprechenden Eigenschaften von Schätzern beeinflussen. Im vorliegenden Beitrag betrachten wir den Teilprozess der Imputation als einen wichtigen Schritt bei der Bereitstellung eines Survey-Datensatzes und argumentieren, dass „Hit-Rate“-Kriterien, die die Genauigkeit eines Datensatzes mit Hilfe einer Distanzfunktion auf „wahren“ aber unbeobachteten und imputierten Werten erfassen wollen, weder sinnvoll noch notwendig sind. Im Gegensatz dazu erlaubt das „Inferenz“-Kriterium unter recht allgemeinen Bedingungen Aussagen über die Validität von Inferenzen, die auf einem geeignet ergänzten Datensatz basieren. Die zugrunde liegenden theoretischen Konzepte werden mit Hilfe einer Simulationsstudie illustriert. Es wird betont, dass dieselben Argumente auf andere mit Unsicherheit behaftete Survey-Prozesse zutreffen.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Biemer PP, Lyberg LE (2003) Introduction to Survey Quality. Wiley, Hoboken
Book Google Scholar
Brackstone G (1999) Managing data quality in a statistical agency. Survey Methodol 25(2):139–149
Google Scholar
Chambers R (2001) Evaluation Criteria for Statistical Editing and Imputation. National Statistics Methodological Series, 28. ( http://www.statistics.gov.uk/methods_quality/publications.asp)
Eurostat (2009) ESS Standard for Quality Reports. Eurostat Methodologies and Working Papers. Office for Official Publications of the European Communities, Luxembourg
Fellegi IP, Holt D (1976) A systematic approach to automatic edit and imputation. J Am Stat Assoc 71(353):17–35
Article Google Scholar
Frick JR, Grabka MM (2005) Item non-response on income questions in panel surveys: incidence, imputation and the impact on inequality and mobility. Allg Stat Arch 89:49–61
Article MathSciNet Google Scholar
Frick JR, Grabka MM (2007) Item Non-response and Imputation of Annual Labor Income in Panel Surveys from a Cross-National Perspective. SOEP Papers on Multidisciplinary Panel Data Research, No. 49. DIW, Berlin
Horton NJ, Lipsitz SR, Parzen M (2003) A potential for bias when rounding in multiple imputation. Am Stat 57(4):229–232
Article MathSciNet Google Scholar
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685
Article MATH MathSciNet Google Scholar
Little RJA, Rubin DB (2002) Statistical Analysis with Missing Data (2. ed). John Wiley and Sons, New York
Google Scholar
Rubin DB (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley and Sons, New York
Book Google Scholar
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91(434):473–489
Article MATH Google Scholar
Särndal C-E, Swensson B, Wretman J (1992) Model Assisted Survey Sampling. Springer, New York
MATH Google Scholar
Schenker N, Taylor JMG (1996) Partially parametric techniques for multiple imputation. Comp Stat Data Anal 22:425–446
Article MATH Google Scholar
de Waal T, Quere R (2003) A fast and simple algorithm for automatic editing of mixed data. J Off Stat 19(4):383–402
Google Scholar
Van Buuren S, Oudshoorn CGM (2000) Multivariate Imputation by Chained Equations: MICE V1.0 User’s manual. Report, PG/VGZ/00.038, TNO Prevention and Health, Leiden

Download references

Author information

Authors and Affiliations

Santander Consumer Bank, Risk Management Vehicles, Santander Platz 1, 41061, Mönchengladbach, Germany
Carsten Kuchler
University of Hamburg, Psychological Methods and DIW Berlin, Von-Melle-Park 5, 20146, Hamburg, Germany
Martin Spiess

Authors

Carsten Kuchler
View author publications
You can also search for this author in PubMed Google Scholar
Martin Spiess
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Spiess.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuchler, C., Spiess, M. The data quality concept of accuracy in the context of publicly shared data sets . AStA Wirtsch Sozialstat Arch 3, 67–80 (2009). https://doi.org/10.1007/s11943-009-0056-0

Download citation

Accepted: 12 May 2009
Published: 28 May 2009
Issue Date: June 2009
DOI: https://doi.org/10.1007/s11943-009-0056-0

Keywords

CR Subject Classification

C42, C81, C11, C13, C15

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The data quality concept of accuracy in the context of publicly shared data sets

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

The Trustworthiness of Content Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

CR Subject Classification

Navigation

The data quality concept of accuracy in the context of publicly shared data sets

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

The Trustworthiness of Content Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CR Subject Classification

Search

Navigation