Skip to main content

Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures

  • Chapter
Soft Computing in Web Information Retrieval

Summary

The goal of privacy protection in statistical databases is to balance the social right to know and the individual right to privacy. When microdata (i.e. data on individual respondents) are released, they should stay analytically useful but should be protected so that it cannot be decided whether a published record matches a specific individual. However, there is some uncertainty in the assessment of data utility, since the specific data uses of the released data cannot always be anticipated by the data protector. Also, there is uncertainty in assessing disclosure risk, because the data protector cannot foresee what will be the information context of potential intruders. Generating synthetic microdata is an alternative to the usual approach based on distorting the original data. The main advantage is that no original data are released, so no disclosure can happen. However, subdomains (i.e. subsets of records) of synthetic datasets do not resemble the corresponding subdomains of the original dataset. Hybrid microdata mixing original and synthetic microdata overcome this lack of analytical validity. We present a fast method for generating numerical hybrid microdata in a way that preserves attribute means, variances and covariances, as well as (to some extent) record similarity and subdomain analyses. We also overcome the uncertainty in assessing data utility by using newly defined probabilistic information loss measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. M. Abowd and S. D. Woodcock (2004) Multiply-imputing confidential characteristics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 290–297, Berlin Heidelberg: Springer.

    Google Scholar 

  2. R. Dandekar, J. Domingo-Ferrer, and F. Sebé (2002) LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 153–162, Berlin Heidelberg: Springer.

    Google Scholar 

  3. J. Domingo-Ferrer and V. Torra (2001) Disclosure protection methods and information loss for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 91–110, Amsterdam: North-Holland.

    Google Scholar 

  4. J. Domingo-Ferrer and V. Torra (2001) A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 111–134, Amsterdam: North-Holland.

    Google Scholar 

  5. M. G. Kendall, A. Stuart, S. F. Arnold J. K. Ord, and A. O’Hagan (1994) Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory (6th Edition). London: Arnold.

    Google Scholar 

  6. A. B. Kennickell (1999) Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg: Office for Official Publications of the European Communities.

    Google Scholar 

  7. J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Sebé (2005) Probabilistic information loss measures for continuous microdata. Data Mining and Knowledge Discovery, to appear.

    Google Scholar 

  8. J. M. Mateo-Sanz, A. Martínez-Ballesté, and J. Domingo-Ferrer (2004) Fast generation of accurate synthetic microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 298–306, Berlin Heidelberg: Springer.

    Google Scholar 

  9. J. M. Mateo-Sanz, F. Sebé, and J. Domingo-Ferrer (2004) Outlier protection in continuous microdata masking. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of LNCS, pages 201–215, Berlin Heidelberg: Springer.

    Google Scholar 

  10. W. Press, W. T. Teukolsky, S. A. Vetterling, and B. Flannery (1993) Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  11. T. J. Raghunathan, J. P. Reiter, and D. Rubin (2003) Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1–16.

    Google Scholar 

  12. J. P. Reiter (2005) Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A, 131(2):365–377.

    MATH  MathSciNet  Google Scholar 

  13. J. P. Reiter (2005) Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 168:185–205.

    MathSciNet  Google Scholar 

  14. D. B. Rubin (1993) Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468.

    Google Scholar 

  15. E. M. Scheuer and D. S. Stoller (1962) On the generation of normal random vectors. Technometrics, 4:278–281.

    Article  MATH  Google Scholar 

  16. W. E. Yancey, W. E. Winkler, and R. H. Creecy (2002) Disclosure risk assessment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of LNCS, pages 135–152, Berlin Heidelberg: Springer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Domingo-Ferrer, J., Mateo-Sanz, J.M., Sebé, F. (2006). Information Loss in Continuous Hybrid Microdata: Subdomain-Level Probabilistic Measures. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-31590-X_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31588-9

  • Online ISBN: 978-3-540-31590-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics