Skip to main content
Log in

Maximum entropy simulation for microdata protection

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functions; these are constrained to be equal to their corresponding sample averages; the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability.

Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality.

The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traits; moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information.

Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major point; related issues are selection of characteristics, goodness of fit and strength of dependence relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abowd J.M. and Woodcock S.D. 2001. Microdata protection through noise addition. In: Doyle P., Lane J.I., Theuwes J.M., and Zayatz L. (Eds.), Confidentiality, Disclosure, and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier, Amsterdam, pp. 215–277.

    Google Scholar 

  • Barndorff-Nielsen O. 1978. Information and Exponential Families in Statistical Theory. Wiley, Chichester.

    Google Scholar 

  • Berntsen J., Espelid T.O., and Genz A. 1991a. An adaptive algorithm for the approximate calculation of multiple integrals. ACM Transactions on Mathematical Software 17(4): 437–451.

    Google Scholar 

  • Berntsen J., Espelid T.O., and Genz A. 1991b. Algorithm 698: DCUHRE: An adaptive multidimensional integration routine for a vector of integrals. ACM Transactions on Mathematical Software 17(4): 452–456.

    Google Scholar 

  • Billingsley P. 1995. Probability and Measure, 3rd edn. Wiley, NewYork.

    Google Scholar 

  • Brand R. 2002. Microdata protection through noise addition. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 97–116.

  • Burridge J. 2003a. Personal communication.

  • Burridge J. 2003b. Information preserving statistical obfuscation. Statistics & Computing. This volume.

  • Byrd R.H., Lu P., Nocedal J., and Zhu C. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16(5): 1190–1208.

    Google Scholar 

  • Cox L.H. 1995. Protecting confidentiality in business surveys. In: Cox B.G., Binder D.A., Chinnappa B.N., Christianson A., Colledge M.J., and Kott P.S. (Eds.), Business Survey Methods. Wiley, pp. 443–473.

  • Csiszár I. 1975. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3: 146–158.

    Google Scholar 

  • Dalenius T. 1977. Towards a methodology for statistical disclosure control. Statistisk Tidskrift 3: 213–225.

    Google Scholar 

  • Dandekar R.A., Domingo-Ferrer J., and Sebé F. 2002. LHS-based hybrid microdata vs. rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 153–162.

  • Dandekar R., Cohen M., and Kirkendall N. 2001. Applicability of Latin Hypercube Sampling to create multi variate synthetic micro data. ETK-NTTS 2001 Pre-proceedings of the Conference Crete, pp. 839–847.

  • Dandekar R., Cohen M., and Kirkendall N. 2002. Sensitive micro data protection using Latin Hypercube Sampling technique. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 117–125.

  • Duncan G.T. and Lambert D. 1989. The risk of disclosure for microdata. Journal of Business and Economic Statistics 7: 207–217.

    Google Scholar 

  • Duncan G.T. and Mukherjee S. 2000. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95: 720–729.

    Google Scholar 

  • Fienberg S.E., Makov U., and Steele R.J. 1998. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14: 485–502. (with discussion).

    Google Scholar 

  • Franconi L. and Stander J. 2000. Model based disclosure limitation for business microdata. In: Proceedings of the International Conference on Establishment Surveys-II. Buffalo, New York, pp. 887–896.

  • Franconi L. and Stander J. 2002 A model based method for disclosure limitation of business microdata. Journal of the Royal Statistical Society, Series D 51: 51–61.

    Google Scholar 

  • Franconi L. and Stander J. 2003. Spatial and non-spatial model-based protection procedures for the release of business microdata. Statistics & Computing. This volume.

  • Frank O. 1978. An application of information theory to the problem of statistical disclosure. Journal of Statistical Planning and Inference 2(2): 143–152.

    Google Scholar 

  • Geyer C.J. 1996. Estimation and optimization of functions. In: Gilks W.R., Richardson S., and Spiegelhalter D.J.E. (Eds.), Markov chain Monte Carlo in Practice. Chapman & Hall, London, pp. 241–258.

    Google Scholar 

  • Grim J., Bo?ek P., and Pudil P. 2001. Safe dissemination of census results by means of interactive probabilistic models. ETK-NTTS 2001 Pre-proceedings of the Conference. Crete, pp. 849–856.

  • Grünwald P.D. 1998. The Minimum Description Length Principle and Reasoning under Uncertainty. PhD thesis. ILLC Dissertation Series DS 1998-03.

  • Grünwald P.D. 2001. Strong entropy concentration, game theory, coding and randomness. Technical Report 2001-010. Eurandom, Eindhoven.

    Google Scholar 

  • Hastings W.K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.

    Google Scholar 

  • Ihara S. 1993. Information Theory for Continuous Systems. World Scientific, Singapore.

    Google Scholar 

  • Jaynes E.T. 1957. Information theory and statistical mechanics II. Physical Review 108: 171–190.

    Google Scholar 

  • Jaynes E.T. 1983. Papers on Probability, Statistics and Statistical Physics. Reidel, Dodrecht, Rosencrantz R.D. (Ed.).

    Google Scholar 

  • Joe H. 1997. Multivariate Models and Dependence concepts. Chapman & Hall, New York.

    Google Scholar 

  • Kennickell A.B. 1999. Multiple imputation and disclosure protection. In: Proceedings of the Conference on Statistical Data Protection. Lisbon, pp. 381–400.

  • Kooiman P. 1998. Comment on “Disclosure limitation using perturbation and related methods for categorical data.” Journal of Official Statistics 14: 503–508.

    Google Scholar 

  • Kullback S. 1968. Information Theory and Statistics. Dover, New York.

    Google Scholar 

  • Little R.J.A. 1993. Statistical analysis of masked data. Journal of Official Statistics 9: 407–426.

    Google Scholar 

  • Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H., and Teller E. 1953. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21: 1087–1092.

    Google Scholar 

  • Muralidhar K., Parsa R., and Sarathy R. 1999. A general additive data perturbation method for database security. Management Science 45: 1399–1415.

    Google Scholar 

  • Muralidhar K. and Sarathy R. 2003. A theoretical basis for perturbation methods. Statistics & Computing. This volume.

  • Polettini S., Franconi L., and Stander J. 2002. Model based disclosure protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 83–96.

  • Raghunathan T.E., Reiter J.P., and Rubin D.B. 2003. Journal of Official Statistics 19: 1–16.

    Google Scholar 

  • Raghunathan T. and Rubin D.B. 2001. Bayesian multiple imputation to preserve confidentiality in public-use data sets. ISBA 2000-The Sixth World Meeting of the International Society for Bayesian Analysis. International Society for Bayesian Analysis. Presentation.

  • Rao C.R. 1973. Linear Statistical Inference and Its Applications. 2nd edn. Wiley, New York.

    Google Scholar 

  • Reiter J.P. 2002a. Methods of inference for partially synthetic, public use data sets. Technical report. Institute of Statistics and Decision Sciences, Duke University, Durham, NC.

    Google Scholar 

  • Reiter J.P. 2002b. Protecting confidentiality by releasing synthetic, public use microdata. Technical report. Institute of Statistics and Decision Sciences. Duke University, Durham, NC.

    Google Scholar 

  • Reiter J.P. 2003. Model diagnostics for remote access servers. Statistics & Computing. This volume

  • Rubin D.B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley, New York.

    Google Scholar 

  • Rubin D.B. 1993. Discussion of “Statistical disclosure limitation”. Journal of Official Statistics 9: 461–468.

    Google Scholar 

  • Sarathy R., Muralidhar K., and Parsa R. 2002. Perturbing nonnormal confidential attributes: The copula approach. Management Science 48: 1613–1627.

    Google Scholar 

  • Schouten B. and Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Statistics & Computing. This volume.

  • Trottini M. 2003. Assessing disclosure risk and data utility: A multiple objectives decision problem. Joint ECE/Eurostat work session on Statistical Data Confidentiality (Luxembourg, 7-9 April 2003).

  • Willenborg L. and de Waal T. 2001. Elements of Statistical disclosure control. Springer, New York.

    Google Scholar 

  • Winkler W.E. 1998. Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics, pp. 87–104.

  • Zhu C., Byrd R.H., Lu P., and Nocedal J. 1997. Algorithm 778. L-BFGS-B: Fortran subroutines for Large-Scale bound constrained optimization. ACM Transactions on Mathematical Software 23(4): 550–560.

    Google Scholar 

  • Zhu S.C., Wu Y., and Mumford D. 1997. Minimax entropy principle and its application to texture modeling. Neural Computation 9: 1627–1660.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Polettini, S. Maximum entropy simulation for microdata protection. Statistics and Computing 13, 307–320 (2003). https://doi.org/10.1023/A:1025606604377

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025606604377

Navigation