Skip to main content
Log in

Efficient multivariate data-oriented microaggregation

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released while preserving the privacy of the underlying individuals. The principle of microaggregation is to aggregate original database records into small groups prior to publication. Each group should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. Recently, microaggregation has been shown to be useful to achieve k-anonymity, in addition to it being a good masking method. Optimal microaggregation (with minimum within-groups variability loss) can be computed in polynomial time for univariate data. Unfortunately, for multivariate data it is an NP-hard problem. Several heuristic approaches to microaggregation have been proposed in the literature. Heuristics yielding groups with fixed size k tends to be more efficient, whereas data-oriented heuristics yielding variable group size tends to result in lower information loss. This paper presents new data-oriented heuristics which improve on the trade-off between computational complexity and information loss and are thus usable for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal, D., Aggarwal, C.C. On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the Symposium on Principles of Database Systems-PODS’2001, Santa Barbara. Association for Computing Machinery, (2001)

  2. Boyens, C., Krishnan, R., Padman, R. On privacy-preserving access to distributed heterogeneous healthcare information. In: Proceedings of the 37th Hawaii International Conference on System Sciences HICSS-37, Big Island, HI IEEE Computer Society (2004)

  3. Brand R. (2002). Microdata protection through noise addition. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol 2316 of LNCS, Springer, Berlin Heidelberg New York, pp. 97–116

    Google Scholar 

  4. Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M. Reference data sets to test and compare sdc methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc (2002)

  5. Burridge J. (2003) Information preserving statistical obfuscation. Stat. Comput. 13, 321–327

    Article  MathSciNet  Google Scholar 

  6. Dalenius T. (1986) Finding a needle in a haystack–or identifying anonymous census records. J. Official Stat. 23, 329–336

    Google Scholar 

  7. Dandekar R., Domingo-Ferrer J., Sebé F. (2002). LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg NewYork, pp. 153–162

    Google Scholar 

  8. Defays, D., Anwar, N. Micro-aggregation: a generic method. In: Proceedings of the 2nd International Symposium on Statistical Confidentiality, pp. 69–78. Eurostat, Luxemburg (1995)

  9. Defays, D., Nanopoulos, P. Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)

  10. Domingo-Ferrer J., Mateo-Sanz J.M. (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1): 189–201

    Article  Google Scholar 

  11. Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Luxemburg, Eurostat (2001)

  12. Domingo-Ferrer, J., Torra, V. A quantitative comparison of disclosure control methods for microdata. In: Doyle P., Lane J.I., Theeuwes J. J. M., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–134. Amsterdam North-Holland, http://vneumann.etse.urv.es/publications/bcpi (2001)

  13. Domingo-Ferrer J., Torra V. (2005) Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining Knowl. Discov. 11(2): 195–212

    Article  MathSciNet  Google Scholar 

  14. Doyle, P., Lane, J.I., Theeuwes, J.J., Zayatz, L.V. (eds). Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam (2001)

  15. Edwards A.W.F., Cavalli-Sforza L.L. (1965) A method for cluster analysis. Biometrics 21, 362–375

    Article  Google Scholar 

  16. Gordon A.D., Henderson J.T. (1977) An algorithm for Euclidean sum of squares classification. Biometrics 33, 355–362

    Article  MATH  Google Scholar 

  17. Hansen P., Jaumard B., Mladenovic N. (1998) Minimum sum of squares clustering in a low dimensional space. J. Classifi. 15, 37–55

    Article  MATH  MathSciNet  Google Scholar 

  18. Hansen S.L., Mukherjee S. (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Eng. 15(4): 1043–1044

    Article  Google Scholar 

  19. Hartigan J.A. (1975) Clustering Algorithms. Wiley, New York

    MATH  Google Scholar 

  20. Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A. DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 3.2 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2003)

  21. Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 4.0 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2005)

  22. Jancey R.C. (1966) Multidimensional group analysis. Aust. J. Bot. 14, 127–130

    Article  Google Scholar 

  23. Laszlo M., Mukherjee S. (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7): 902–911

    Article  Google Scholar 

  24. Lenz, R., Vorgrimler, D. Matching German turnover tax statistics. In: Technical Report FDZ-Arbeitspapier Nr. 4, Statistische Ämter des Bundes und der Länder-Forschungsdatenzentren (2005)

  25. MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol., 1, 281–297 (1967)

  26. Mateo-Sanz, J.M., Domingo-Ferrer, J. A method for data-oriented multivariate microaggregation. In: Domingo-Ferrer, J., (ed.) Statistical Data Protection, (pp. 89–99) Luxemburg, (1999) Office for Official Publications of the European Communities

  27. Mateo-Sanz, J.M., Domingo-Ferrer, J. Heuristic techniques for multivariate microaggregation. In: COMPSTAT’2000, Utrecht. CBS-Statistics, Netherlands (2000)

  28. Mateo-Sanz J.M., Martínez-Ballesté A., Domingo-Ferrer J. (2004). Fast generation of accurate synthetic microdata. In: Domingo-Ferrer J., Torra V. (eds). Privacy in Statistical Databases, volume 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 298–306

    Google Scholar 

  29. Oganian A., Domingo-Ferrer J. (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Com. Eur. 18(4): 345–354

    Google Scholar 

  30. Pagliuca, D., Seri, G. Some results of the individual ranking method on the system of enterprise accounts annual survey. In: Technical report, ESPRIT SDC Project, Deliverable MI-3/D2.11 (1999)

  31. Rosemann, M. Erste Ergebnisse von vergleichenden Untersuchungen mit anonymisierten und nicht anonymisierten Einzeldaten am Beispiel der Kostenstrukturerhebung und der Umsatzsteuerstatistik. In: Ronning, G., Gnoss, R., (eds.), Anonymisierung wirtschaftsstatistischer Einzeldaten, (pp.154–183) Wiesbaden, Germany, Statistisches Bundesamt (2003)

  32. Samarati P. (2001) Protecting respondents’ identities in microdata release. IEEE Trans. Know. and Data Eng. 13(6): 1010–1027

    Article  Google Scholar 

  33. Samarati, P., Sweeney, L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In Technical report, SRI International, (1998)

  34. Sande G. (2002) Exact and approximate methods for data directed microaggregation in one or more dimensions. Int. J. Uncert. Fuzziness Know. Based Sys. 10(5): 459–476

    Article  MATH  MathSciNet  Google Scholar 

  35. Sweeney L. (2002) k-Anonimity: a model for protecting privacy. Int. J. Uncert. Fuzziness Knowl. Based Sys. 10(5): 557–570

    Article  MATH  MathSciNet  Google Scholar 

  36. Torra V. (2004). Microaggregation for categorical variables: a median based approach. In: Domingo-Ferrer J., Torra V. (eds). Privacy Stat. Databases vol. 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 162–174

    Google Scholar 

  37. Torra V., Domingo-Ferrer J. (2003). Record linkage methods for multidatabase data mining. In: Torra V. (eds). Information Fusion in Data Mining. Springer, Germany, pp.101–132

    Google Scholar 

  38. UNECE. United Nations Economic Commission for Europe: Questionnaire on disclosure and confidentiality–summary of replies. In: 2nd Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, Macedonia (2001)

  39. UNECE. United Nations Economic Commission for Europe: 2003 Questionnaire on statistical confidentiality – summary of replies from Central and Eastern Europe. In: 4th Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg (2005)

  40. Ward J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244

    Article  Google Scholar 

  41. Willenborg L., DeWaal T. (2001) Elements of Statistical Disclosure Control. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  42. Yancey W.E., Winkler W.E., Creecy R.H. (2002). Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg New York, pp. 135–152

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josep Domingo-Ferrer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M. et al. Efficient multivariate data-oriented microaggregation. The VLDB Journal 15, 355–369 (2006). https://doi.org/10.1007/s00778-006-0007-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0007-0

Keywords

Navigation