Summary
The goal of privacy protection in statistical databases is to balance the social right to know and the individual right to privacy. When microdata (i.e. data on individual respondents) are released, they should stay analytically useful but should be protected so that it cannot be decided whether a published record matches a specific individual. However, there is some uncertainty in the assessment of data utility, since the specific data uses of the released data cannot always be anticipated by the data protector. Also, there is uncertainty in assessing disclosure risk, because the data protector cannot foresee what will be the information context of potential intruders. Generating synthetic microdata is an alternative to the usual approach based on distorting the original data. The main advantage is that no original data are released, so no disclosure can happen. However, subdomains (i.e. subsets of records) of synthetic datasets do not resemble the corresponding subdomains of the original dataset. Hybrid microdata mixing original and synthetic microdata overcome this lack of analytical validity. We present a fast method for generating numerical hybrid microdata in a way that preserves attribute means, variances and covariances, as well as (to some extent) record similarity and subdomain analyses. We also overcome the uncertainty in assessing data utility by using newly defined probabilistic information loss measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. M. Abowd and S. D. Woodcock (2004) Multiply-imputing confidential characteristics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 290–297, Berlin Heidelberg: Springer.
R. Dandekar, J. Domingo-Ferrer, and F. Sebé (2002) LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 153–162, Berlin Heidelberg: Springer.
J. Domingo-Ferrer and V. Torra (2001) Disclosure protection methods and information loss for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 91–110, Amsterdam: North-Holland.
J. Domingo-Ferrer and V. Torra (2001) A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 111–134, Amsterdam: North-Holland.
M. G. Kendall, A. Stuart, S. F. Arnold J. K. Ord, and A. O’Hagan (1994) Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory (6th Edition). London: Arnold.
A. B. Kennickell (1999) Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg: Office for Official Publications of the European Communities.
J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Sebé (2005) Probabilistic information loss measures for continuous microdata. Data Mining and Knowledge Discovery, to appear.
J. M. Mateo-Sanz, A. MartĂnez-BallestĂ©, and J. Domingo-Ferrer (2004) Fast generation of accurate synthetic microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 298–306, Berlin Heidelberg: Springer.
J. M. Mateo-Sanz, F. Sebé, and J. Domingo-Ferrer (2004) Outlier protection in continuous microdata masking. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 201–215, Berlin Heidelberg: Springer.
W. Press, W. T. Teukolsky, S. A. Vetterling, and B. Flannery (1993) Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge University Press.
T. J. Raghunathan, J. P. Reiter, and D. Rubin (2003) Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1–16.
J. P. Reiter (2005) Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A, 131(2):365–377.
J. P. Reiter (2005) Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 168:185–205.
D. B. Rubin (1993) Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468.
E. M. Scheuer and D. S. Stoller (1962) On the generation of normal random vectors. Technometrics, 4:278–281.
W. E. Yancey, W. E. Winkler, and R. H. Creecy (2002) Disclosure risk assessment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 135–152, Berlin Heidelberg: Springer.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Domingo-Ferrer, J., Mateo-Sanz, J.M., Sebé, F. (2006). Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_14
Download citation
DOI: https://doi.org/10.1007/3-540-31590-X_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)