Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures

Domingo-Ferrer, Josep; Mateo-Sanz, Josep Maria; Sebé, Francesc

doi:10.1007/3-540-31590-X_14

Josep Domingo-Ferrer⁵,
Josep Maria Mateo-Sanz⁵ &
Francesc Sebé⁵

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 197))

Summary

The goal of privacy protection in statistical databases is to balance the social right to know and the individual right to privacy. When microdata (i.e. data on individual respondents) are released, they should stay analytically useful but should be protected so that it cannot be decided whether a published record matches a specific individual. However, there is some uncertainty in the assessment of data utility, since the specific data uses of the released data cannot always be anticipated by the data protector. Also, there is uncertainty in assessing disclosure risk, because the data protector cannot foresee what will be the information context of potential intruders. Generating synthetic microdata is an alternative to the usual approach based on distorting the original data. The main advantage is that no original data are released, so no disclosure can happen. However, subdomains (i.e. subsets of records) of synthetic datasets do not resemble the corresponding subdomains of the original dataset. Hybrid microdata mixing original and synthetic microdata overcome this lack of analytical validity. We present a fast method for generating numerical hybrid microdata in a way that preserves attribute means, variances and covariances, as well as (to some extent) record similarity and subdomain analyses. We also overcome the uncertainty in assessing data utility by using newly defined probabilistic information loss measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. M. Abowd and S. D. Woodcock (2004) Multiply-imputing confidential characteristics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 290–297, Berlin Heidelberg: Springer.
Google Scholar
R. Dandekar, J. Domingo-Ferrer, and F. Sebé (2002) LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 153–162, Berlin Heidelberg: Springer.
Google Scholar
J. Domingo-Ferrer and V. Torra (2001) Disclosure protection methods and information loss for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 91–110, Amsterdam: North-Holland.
Google Scholar
J. Domingo-Ferrer and V. Torra (2001) A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 111–134, Amsterdam: North-Holland.
Google Scholar
M. G. Kendall, A. Stuart, S. F. Arnold J. K. Ord, and A. O’Hagan (1994) Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory (6th Edition). London: Arnold.
Google Scholar
A. B. Kennickell (1999) Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg: Office for Official Publications of the European Communities.
Google Scholar
J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Sebé (2005) Probabilistic information loss measures for continuous microdata. Data Mining and Knowledge Discovery, to appear.
Google Scholar
J. M. Mateo-Sanz, A. Martínez-Ballesté, and J. Domingo-Ferrer (2004) Fast generation of accurate synthetic microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 298–306, Berlin Heidelberg: Springer.
Google Scholar
J. M. Mateo-Sanz, F. Sebé, and J. Domingo-Ferrer (2004) Outlier protection in continuous microdata masking. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 201–215, Berlin Heidelberg: Springer.
Google Scholar
W. Press, W. T. Teukolsky, S. A. Vetterling, and B. Flannery (1993) Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge University Press.
Google Scholar
T. J. Raghunathan, J. P. Reiter, and D. Rubin (2003) Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1–16.
Google Scholar
J. P. Reiter (2005) Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A, 131(2):365–377.
MATH MathSciNet Google Scholar
J. P. Reiter (2005) Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 168:185–205.
MathSciNet Google Scholar
D. B. Rubin (1993) Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468.
Google Scholar
E. M. Scheuer and D. S. Stoller (1962) On the generation of normal random vectors. Technometrics, 4:278–281.
Article MATH Google Scholar
W. E. Yancey, W. E. Winkler, and R. H. Creecy (2002) Disclosure risk assessment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 135–152, Berlin Heidelberg: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

Rovira i Virgili University of Tarragona, Av. Països Catalans 26, E-43007, Tarragona, Catalonia
Josep Domingo-Ferrer, Josep Maria Mateo-Sanz & Francesc Sebé

Authors

Josep Domingo-Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Josep Maria Mateo-Sanz
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Sebé
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and A.I E.T.S.I. Informatica, University of Granada, C/Periodista Daniel, Saucedo Aranda s/n, Granada, Spain
Enrique Herrera-Viedma
Department of Informatics Systems and Communication (DISCo), Università degli Studi di Milano Bicocca, Via Bicocca degli Arcimboldi, 8 (Edificio U7), 20126, Milano, Itay
Gabriella Pasi
Department of Computer and Information Sciences, University of Strathclyde, Livingstone Tower, 26 Richmond Street, Glasgow, G1 1XH, Scotland, UK
Fabio Crestani

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Domingo-Ferrer, J., Mateo-Sanz, J.M., Sebé, F. (2006). Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_14

Download citation

DOI: https://doi.org/10.1007/3-540-31590-X_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics