Maximum entropy simulation for microdata protection

Polettini, Silvia

doi:10.1023/A:1025606604377

Maximum entropy simulation for microdata protection

Published: October 2003

Volume 13, pages 307–320, (2003)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Silvia Polettini¹

174 Accesses
15 Citations
Explore all metrics

Abstract

The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functions; these are constrained to be equal to their corresponding sample averages; the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability.

Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality.

The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traits; moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information.

Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major point; related issues are selection of characteristics, goodness of fit and strength of dependence relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abowd J.M. and Woodcock S.D. 2001. Microdata protection through noise addition. In: Doyle P., Lane J.I., Theuwes J.M., and Zayatz L. (Eds.), Confidentiality, Disclosure, and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier, Amsterdam, pp. 215–277.
Google Scholar
Barndorff-Nielsen O. 1978. Information and Exponential Families in Statistical Theory. Wiley, Chichester.
Google Scholar
Berntsen J., Espelid T.O., and Genz A. 1991a. An adaptive algorithm for the approximate calculation of multiple integrals. ACM Transactions on Mathematical Software 17(4): 437–451.
Google Scholar
Berntsen J., Espelid T.O., and Genz A. 1991b. Algorithm 698: DCUHRE: An adaptive multidimensional integration routine for a vector of integrals. ACM Transactions on Mathematical Software 17(4): 452–456.
Google Scholar
Billingsley P. 1995. Probability and Measure, 3rd edn. Wiley, NewYork.
Google Scholar
Brand R. 2002. Microdata protection through noise addition. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 97–116.
Burridge J. 2003a. Personal communication.
Burridge J. 2003b. Information preserving statistical obfuscation. Statistics & Computing. This volume.
Byrd R.H., Lu P., Nocedal J., and Zhu C. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16(5): 1190–1208.
Google Scholar
Cox L.H. 1995. Protecting confidentiality in business surveys. In: Cox B.G., Binder D.A., Chinnappa B.N., Christianson A., Colledge M.J., and Kott P.S. (Eds.), Business Survey Methods. Wiley, pp. 443–473.
Csiszár I. 1975. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3: 146–158.
Google Scholar
Dalenius T. 1977. Towards a methodology for statistical disclosure control. Statistisk Tidskrift 3: 213–225.
Google Scholar
Dandekar R.A., Domingo-Ferrer J., and Sebé F. 2002. LHS-based hybrid microdata vs. rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 153–162.
Dandekar R., Cohen M., and Kirkendall N. 2001. Applicability of Latin Hypercube Sampling to create multi variate synthetic micro data. ETK-NTTS 2001 Pre-proceedings of the Conference Crete, pp. 839–847.
Dandekar R., Cohen M., and Kirkendall N. 2002. Sensitive micro data protection using Latin Hypercube Sampling technique. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 117–125.
Duncan G.T. and Lambert D. 1989. The risk of disclosure for microdata. Journal of Business and Economic Statistics 7: 207–217.
Google Scholar
Duncan G.T. and Mukherjee S. 2000. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95: 720–729.
Google Scholar
Fienberg S.E., Makov U., and Steele R.J. 1998. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14: 485–502. (with discussion).
Google Scholar
Franconi L. and Stander J. 2000. Model based disclosure limitation for business microdata. In: Proceedings of the International Conference on Establishment Surveys-II. Buffalo, New York, pp. 887–896.
Franconi L. and Stander J. 2002 A model based method for disclosure limitation of business microdata. Journal of the Royal Statistical Society, Series D 51: 51–61.
Google Scholar
Franconi L. and Stander J. 2003. Spatial and non-spatial model-based protection procedures for the release of business microdata. Statistics & Computing. This volume.
Frank O. 1978. An application of information theory to the problem of statistical disclosure. Journal of Statistical Planning and Inference 2(2): 143–152.
Google Scholar
Geyer C.J. 1996. Estimation and optimization of functions. In: Gilks W.R., Richardson S., and Spiegelhalter D.J.E. (Eds.), Markov chain Monte Carlo in Practice. Chapman & Hall, London, pp. 241–258.
Google Scholar
Grim J., Bo?ek P., and Pudil P. 2001. Safe dissemination of census results by means of interactive probabilistic models. ETK-NTTS 2001 Pre-proceedings of the Conference. Crete, pp. 849–856.
Grünwald P.D. 1998. The Minimum Description Length Principle and Reasoning under Uncertainty. PhD thesis. ILLC Dissertation Series DS 1998-03.
Grünwald P.D. 2001. Strong entropy concentration, game theory, coding and randomness. Technical Report 2001-010. Eurandom, Eindhoven.
Google Scholar
Hastings W.K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.
Google Scholar
Ihara S. 1993. Information Theory for Continuous Systems. World Scientific, Singapore.
Google Scholar
Jaynes E.T. 1957. Information theory and statistical mechanics II. Physical Review 108: 171–190.
Google Scholar
Jaynes E.T. 1983. Papers on Probability, Statistics and Statistical Physics. Reidel, Dodrecht, Rosencrantz R.D. (Ed.).
Google Scholar
Joe H. 1997. Multivariate Models and Dependence concepts. Chapman & Hall, New York.
Google Scholar
Kennickell A.B. 1999. Multiple imputation and disclosure protection. In: Proceedings of the Conference on Statistical Data Protection. Lisbon, pp. 381–400.
Kooiman P. 1998. Comment on “Disclosure limitation using perturbation and related methods for categorical data.” Journal of Official Statistics 14: 503–508.
Google Scholar
Kullback S. 1968. Information Theory and Statistics. Dover, New York.
Google Scholar
Little R.J.A. 1993. Statistical analysis of masked data. Journal of Official Statistics 9: 407–426.
Google Scholar
Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H., and Teller E. 1953. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21: 1087–1092.
Google Scholar
Muralidhar K., Parsa R., and Sarathy R. 1999. A general additive data perturbation method for database security. Management Science 45: 1399–1415.
Google Scholar
Muralidhar K. and Sarathy R. 2003. A theoretical basis for perturbation methods. Statistics & Computing. This volume.
Polettini S., Franconi L., and Stander J. 2002. Model based disclosure protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 83–96.
Raghunathan T.E., Reiter J.P., and Rubin D.B. 2003. Journal of Official Statistics 19: 1–16.
Google Scholar
Raghunathan T. and Rubin D.B. 2001. Bayesian multiple imputation to preserve confidentiality in public-use data sets. ISBA 2000-The Sixth World Meeting of the International Society for Bayesian Analysis. International Society for Bayesian Analysis. Presentation.
Rao C.R. 1973. Linear Statistical Inference and Its Applications. 2nd edn. Wiley, New York.
Google Scholar
Reiter J.P. 2002a. Methods of inference for partially synthetic, public use data sets. Technical report. Institute of Statistics and Decision Sciences, Duke University, Durham, NC.
Google Scholar
Reiter J.P. 2002b. Protecting confidentiality by releasing synthetic, public use microdata. Technical report. Institute of Statistics and Decision Sciences. Duke University, Durham, NC.
Google Scholar
Reiter J.P. 2003. Model diagnostics for remote access servers. Statistics & Computing. This volume
Rubin D.B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
Google Scholar
Rubin D.B. 1993. Discussion of “Statistical disclosure limitation”. Journal of Official Statistics 9: 461–468.
Google Scholar
Sarathy R., Muralidhar K., and Parsa R. 2002. Perturbing nonnormal confidential attributes: The copula approach. Management Science 48: 1613–1627.
Google Scholar
Schouten B. and Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Statistics & Computing. This volume.
Trottini M. 2003. Assessing disclosure risk and data utility: A multiple objectives decision problem. Joint ECE/Eurostat work session on Statistical Data Confidentiality (Luxembourg, 7-9 April 2003).
Willenborg L. and de Waal T. 2001. Elements of Statistical disclosure control. Springer, New York.
Google Scholar
Winkler W.E. 1998. Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics, pp. 87–104.
Zhu C., Byrd R.H., Lu P., and Nocedal J. 1997. Algorithm 778. L-BFGS-B: Fortran subroutines for Large-Scale bound constrained optimization. ACM Transactions on Mathematical Software 23(4): 550–560.
Google Scholar
Zhu S.C., Wu Y., and Mumford D. 1997. Minimax entropy principle and its application to texture modeling. Neural Computation 9: 1627–1660.
Google Scholar

Download references

Author information

Authors and Affiliations

ISTAT, Servizio della Metodologia di Base per la Produzione Statistica, Via Cesare Balbo 16, 00184, Roma, Italy
Silvia Polettini

Authors

Silvia Polettini
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Polettini, S. Maximum entropy simulation for microdata protection. Statistics and Computing 13, 307–320 (2003). https://doi.org/10.1023/A:1025606604377

Download citation

Issue Date: October 2003
DOI: https://doi.org/10.1023/A:1025606604377

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum entropy simulation for microdata protection

Abstract

Access this article

Similar content being viewed by others

Advantages of Imputation vs. Data Swapping for Statistical Disclosure Control

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Applying the Nonrandomized Diagonal Model to Estimate a Sensitive Distribution in Complex Sample Surveys

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Maximum entropy simulation for microdata protection

Abstract

Access this article

Similar content being viewed by others

Advantages of Imputation vs. Data Swapping for Statistical Disclosure Control

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Applying the Nonrandomized Diagonal Model to Estimate a Sensitive Distribution in Complex Sample Surveys

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation