Abstract
The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functions; these are constrained to be equal to their corresponding sample averages; the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability.
Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality.
The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traits; moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information.
Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major point; related issues are selection of characteristics, goodness of fit and strength of dependence relations.
Similar content being viewed by others
References
Abowd J.M. and Woodcock S.D. 2001. Microdata protection through noise addition. In: Doyle P., Lane J.I., Theuwes J.M., and Zayatz L. (Eds.), Confidentiality, Disclosure, and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier, Amsterdam, pp. 215–277.
Barndorff-Nielsen O. 1978. Information and Exponential Families in Statistical Theory. Wiley, Chichester.
Berntsen J., Espelid T.O., and Genz A. 1991a. An adaptive algorithm for the approximate calculation of multiple integrals. ACM Transactions on Mathematical Software 17(4): 437–451.
Berntsen J., Espelid T.O., and Genz A. 1991b. Algorithm 698: DCUHRE: An adaptive multidimensional integration routine for a vector of integrals. ACM Transactions on Mathematical Software 17(4): 452–456.
Billingsley P. 1995. Probability and Measure, 3rd edn. Wiley, NewYork.
Brand R. 2002. Microdata protection through noise addition. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 97–116.
Burridge J. 2003a. Personal communication.
Burridge J. 2003b. Information preserving statistical obfuscation. Statistics & Computing. This volume.
Byrd R.H., Lu P., Nocedal J., and Zhu C. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16(5): 1190–1208.
Cox L.H. 1995. Protecting confidentiality in business surveys. In: Cox B.G., Binder D.A., Chinnappa B.N., Christianson A., Colledge M.J., and Kott P.S. (Eds.), Business Survey Methods. Wiley, pp. 443–473.
Csiszár I. 1975. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3: 146–158.
Dalenius T. 1977. Towards a methodology for statistical disclosure control. Statistisk Tidskrift 3: 213–225.
Dandekar R.A., Domingo-Ferrer J., and Sebé F. 2002. LHS-based hybrid microdata vs. rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 153–162.
Dandekar R., Cohen M., and Kirkendall N. 2001. Applicability of Latin Hypercube Sampling to create multi variate synthetic micro data. ETK-NTTS 2001 Pre-proceedings of the Conference Crete, pp. 839–847.
Dandekar R., Cohen M., and Kirkendall N. 2002. Sensitive micro data protection using Latin Hypercube Sampling technique. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 117–125.
Duncan G.T. and Lambert D. 1989. The risk of disclosure for microdata. Journal of Business and Economic Statistics 7: 207–217.
Duncan G.T. and Mukherjee S. 2000. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95: 720–729.
Fienberg S.E., Makov U., and Steele R.J. 1998. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14: 485–502. (with discussion).
Franconi L. and Stander J. 2000. Model based disclosure limitation for business microdata. In: Proceedings of the International Conference on Establishment Surveys-II. Buffalo, New York, pp. 887–896.
Franconi L. and Stander J. 2002 A model based method for disclosure limitation of business microdata. Journal of the Royal Statistical Society, Series D 51: 51–61.
Franconi L. and Stander J. 2003. Spatial and non-spatial model-based protection procedures for the release of business microdata. Statistics & Computing. This volume.
Frank O. 1978. An application of information theory to the problem of statistical disclosure. Journal of Statistical Planning and Inference 2(2): 143–152.
Geyer C.J. 1996. Estimation and optimization of functions. In: Gilks W.R., Richardson S., and Spiegelhalter D.J.E. (Eds.), Markov chain Monte Carlo in Practice. Chapman & Hall, London, pp. 241–258.
Grim J., Bo?ek P., and Pudil P. 2001. Safe dissemination of census results by means of interactive probabilistic models. ETK-NTTS 2001 Pre-proceedings of the Conference. Crete, pp. 849–856.
Grünwald P.D. 1998. The Minimum Description Length Principle and Reasoning under Uncertainty. PhD thesis. ILLC Dissertation Series DS 1998-03.
Grünwald P.D. 2001. Strong entropy concentration, game theory, coding and randomness. Technical Report 2001-010. Eurandom, Eindhoven.
Hastings W.K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.
Ihara S. 1993. Information Theory for Continuous Systems. World Scientific, Singapore.
Jaynes E.T. 1957. Information theory and statistical mechanics II. Physical Review 108: 171–190.
Jaynes E.T. 1983. Papers on Probability, Statistics and Statistical Physics. Reidel, Dodrecht, Rosencrantz R.D. (Ed.).
Joe H. 1997. Multivariate Models and Dependence concepts. Chapman & Hall, New York.
Kennickell A.B. 1999. Multiple imputation and disclosure protection. In: Proceedings of the Conference on Statistical Data Protection. Lisbon, pp. 381–400.
Kooiman P. 1998. Comment on “Disclosure limitation using perturbation and related methods for categorical data.” Journal of Official Statistics 14: 503–508.
Kullback S. 1968. Information Theory and Statistics. Dover, New York.
Little R.J.A. 1993. Statistical analysis of masked data. Journal of Official Statistics 9: 407–426.
Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H., and Teller E. 1953. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21: 1087–1092.
Muralidhar K., Parsa R., and Sarathy R. 1999. A general additive data perturbation method for database security. Management Science 45: 1399–1415.
Muralidhar K. and Sarathy R. 2003. A theoretical basis for perturbation methods. Statistics & Computing. This volume.
Polettini S., Franconi L., and Stander J. 2002. Model based disclosure protection. In: Domingo-Ferrer J. (Ed.), Inference Control in Statistical Databases. Vol. 2316 of Lecture Notes in Computer Science. Springer, pp. 83–96.
Raghunathan T.E., Reiter J.P., and Rubin D.B. 2003. Journal of Official Statistics 19: 1–16.
Raghunathan T. and Rubin D.B. 2001. Bayesian multiple imputation to preserve confidentiality in public-use data sets. ISBA 2000-The Sixth World Meeting of the International Society for Bayesian Analysis. International Society for Bayesian Analysis. Presentation.
Rao C.R. 1973. Linear Statistical Inference and Its Applications. 2nd edn. Wiley, New York.
Reiter J.P. 2002a. Methods of inference for partially synthetic, public use data sets. Technical report. Institute of Statistics and Decision Sciences, Duke University, Durham, NC.
Reiter J.P. 2002b. Protecting confidentiality by releasing synthetic, public use microdata. Technical report. Institute of Statistics and Decision Sciences. Duke University, Durham, NC.
Reiter J.P. 2003. Model diagnostics for remote access servers. Statistics & Computing. This volume
Rubin D.B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
Rubin D.B. 1993. Discussion of “Statistical disclosure limitation”. Journal of Official Statistics 9: 461–468.
Sarathy R., Muralidhar K., and Parsa R. 2002. Perturbing nonnormal confidential attributes: The copula approach. Management Science 48: 1613–1627.
Schouten B. and Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Statistics & Computing. This volume.
Trottini M. 2003. Assessing disclosure risk and data utility: A multiple objectives decision problem. Joint ECE/Eurostat work session on Statistical Data Confidentiality (Luxembourg, 7-9 April 2003).
Willenborg L. and de Waal T. 2001. Elements of Statistical disclosure control. Springer, New York.
Winkler W.E. 1998. Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics, pp. 87–104.
Zhu C., Byrd R.H., Lu P., and Nocedal J. 1997. Algorithm 778. L-BFGS-B: Fortran subroutines for Large-Scale bound constrained optimization. ACM Transactions on Mathematical Software 23(4): 550–560.
Zhu S.C., Wu Y., and Mumford D. 1997. Minimax entropy principle and its application to texture modeling. Neural Computation 9: 1627–1660.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Polettini, S. Maximum entropy simulation for microdata protection. Statistics and Computing 13, 307–320 (2003). https://doi.org/10.1023/A:1025606604377
Issue Date:
DOI: https://doi.org/10.1023/A:1025606604377