Efficient multivariate data-oriented microaggregation

Domingo-Ferrer, Josep; Martínez-Ballesté, Antoni; Mateo-Sanz, Josep Maria; Sebé, Francesc

doi:10.1007/s00778-006-0007-0

Efficient multivariate data-oriented microaggregation

Special Issue Paper
Published: 29 August 2006

Volume 15, pages 355–369, (2006)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Josep Domingo-Ferrer¹,
Antoni Martínez-Ballesté¹,
Josep Maria Mateo-Sanz² &
…
Francesc Sebé¹

248 Accesses
104 Citations
3 Altmetric
Explore all metrics

Abstract

Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released while preserving the privacy of the underlying individuals. The principle of microaggregation is to aggregate original database records into small groups prior to publication. Each group should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. Recently, microaggregation has been shown to be useful to achieve k-anonymity, in addition to it being a good masking method. Optimal microaggregation (with minimum within-groups variability loss) can be computed in polynomial time for univariate data. Unfortunately, for multivariate data it is an NP-hard problem. Several heuristic approaches to microaggregation have been proposed in the literature. Heuristics yielding groups with fixed size k tends to be more efficient, whereas data-oriented heuristics yielding variable group size tends to result in lower information loss. This paper presents new data-oriented heuristics which improve on the trade-off between computational complexity and information loss and are thus usable for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Near-Optimal Variable-Size Microaggregation

Beyond Multivariate Microaggregation for Large Record Anonymization

New Multi-dimensional Sorting Based K-Anonymity Microaggregation for Statistical Disclosure Control

References

Agrawal, D., Aggarwal, C.C. On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the Symposium on Principles of Database Systems-PODS’2001, Santa Barbara. Association for Computing Machinery, (2001)
Boyens, C., Krishnan, R., Padman, R. On privacy-preserving access to distributed heterogeneous healthcare information. In: Proceedings of the 37th Hawaii International Conference on System Sciences HICSS-37, Big Island, HI IEEE Computer Society (2004)
Brand R. (2002). Microdata protection through noise addition. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol 2316 of LNCS, Springer, Berlin Heidelberg New York, pp. 97–116
Google Scholar
Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M. Reference data sets to test and compare sdc methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc (2002)
Burridge J. (2003) Information preserving statistical obfuscation. Stat. Comput. 13, 321–327
Article MathSciNet Google Scholar
Dalenius T. (1986) Finding a needle in a haystack–or identifying anonymous census records. J. Official Stat. 23, 329–336
Google Scholar
Dandekar R., Domingo-Ferrer J., Sebé F. (2002). LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg NewYork, pp. 153–162
Google Scholar
Defays, D., Anwar, N. Micro-aggregation: a generic method. In: Proceedings of the 2nd International Symposium on Statistical Confidentiality, pp. 69–78. Eurostat, Luxemburg (1995)
Defays, D., Nanopoulos, P. Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)
Domingo-Ferrer J., Mateo-Sanz J.M. (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1): 189–201
Article Google Scholar
Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Luxemburg, Eurostat (2001)
Domingo-Ferrer, J., Torra, V. A quantitative comparison of disclosure control methods for microdata. In: Doyle P., Lane J.I., Theeuwes J. J. M., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–134. Amsterdam North-Holland, http://vneumann.etse.urv.es/publications/bcpi (2001)
Domingo-Ferrer J., Torra V. (2005) Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining Knowl. Discov. 11(2): 195–212
Article MathSciNet Google Scholar
Doyle, P., Lane, J.I., Theeuwes, J.J., Zayatz, L.V. (eds). Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam (2001)
Edwards A.W.F., Cavalli-Sforza L.L. (1965) A method for cluster analysis. Biometrics 21, 362–375
Article Google Scholar
Gordon A.D., Henderson J.T. (1977) An algorithm for Euclidean sum of squares classification. Biometrics 33, 355–362
Article MATH Google Scholar
Hansen P., Jaumard B., Mladenovic N. (1998) Minimum sum of squares clustering in a low dimensional space. J. Classifi. 15, 37–55
Article MATH MathSciNet Google Scholar
Hansen S.L., Mukherjee S. (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Eng. 15(4): 1043–1044
Article Google Scholar
Hartigan J.A. (1975) Clustering Algorithms. Wiley, New York
MATH Google Scholar
Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A. DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 3.2 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2003)
Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 4.0 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2005)
Jancey R.C. (1966) Multidimensional group analysis. Aust. J. Bot. 14, 127–130
Article Google Scholar
Laszlo M., Mukherjee S. (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7): 902–911
Article Google Scholar
Lenz, R., Vorgrimler, D. Matching German turnover tax statistics. In: Technical Report FDZ-Arbeitspapier Nr. 4, Statistische Ämter des Bundes und der Länder-Forschungsdatenzentren (2005)
MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol., 1, 281–297 (1967)
Mateo-Sanz, J.M., Domingo-Ferrer, J. A method for data-oriented multivariate microaggregation. In: Domingo-Ferrer, J., (ed.) Statistical Data Protection, (pp. 89–99) Luxemburg, (1999) Office for Official Publications of the European Communities
Mateo-Sanz, J.M., Domingo-Ferrer, J. Heuristic techniques for multivariate microaggregation. In: COMPSTAT’2000, Utrecht. CBS-Statistics, Netherlands (2000)
Mateo-Sanz J.M., Martínez-Ballesté A., Domingo-Ferrer J. (2004). Fast generation of accurate synthetic microdata. In: Domingo-Ferrer J., Torra V. (eds). Privacy in Statistical Databases, volume 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 298–306
Google Scholar
Oganian A., Domingo-Ferrer J. (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Com. Eur. 18(4): 345–354
Google Scholar
Pagliuca, D., Seri, G. Some results of the individual ranking method on the system of enterprise accounts annual survey. In: Technical report, ESPRIT SDC Project, Deliverable MI-3/D2.11 (1999)
Rosemann, M. Erste Ergebnisse von vergleichenden Untersuchungen mit anonymisierten und nicht anonymisierten Einzeldaten am Beispiel der Kostenstrukturerhebung und der Umsatzsteuerstatistik. In: Ronning, G., Gnoss, R., (eds.), Anonymisierung wirtschaftsstatistischer Einzeldaten, (pp.154–183) Wiesbaden, Germany, Statistisches Bundesamt (2003)
Samarati P. (2001) Protecting respondents’ identities in microdata release. IEEE Trans. Know. and Data Eng. 13(6): 1010–1027
Article Google Scholar
Samarati, P., Sweeney, L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In Technical report, SRI International, (1998)
Sande G. (2002) Exact and approximate methods for data directed microaggregation in one or more dimensions. Int. J. Uncert. Fuzziness Know. Based Sys. 10(5): 459–476
Article MATH MathSciNet Google Scholar
Sweeney L. (2002) k-Anonimity: a model for protecting privacy. Int. J. Uncert. Fuzziness Knowl. Based Sys. 10(5): 557–570
Article MATH MathSciNet Google Scholar
Torra V. (2004). Microaggregation for categorical variables: a median based approach. In: Domingo-Ferrer J., Torra V. (eds). Privacy Stat. Databases vol. 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 162–174
Google Scholar
Torra V., Domingo-Ferrer J. (2003). Record linkage methods for multidatabase data mining. In: Torra V. (eds). Information Fusion in Data Mining. Springer, Germany, pp.101–132
Google Scholar
UNECE. United Nations Economic Commission for Europe: Questionnaire on disclosure and confidentiality–summary of replies. In: 2nd Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, Macedonia (2001)
UNECE. United Nations Economic Commission for Europe: 2003 Questionnaire on statistical confidentiality – summary of replies from Central and Eastern Europe. In: 4th Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg (2005)
Ward J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244
Article Google Scholar
Willenborg L., DeWaal T. (2001) Elements of Statistical Disclosure Control. Springer, Berlin Heidelberg New York
MATH Google Scholar
Yancey W.E., Winkler W.E., Creecy R.H. (2002). Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg New York, pp. 135–152
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering & Maths, Rovira i Virgili University of Tarragona, Av. Països Catalans 26, Tarragona, Catalonia
Josep Domingo-Ferrer, Antoni Martínez-Ballesté & Francesc Sebé
Statistics Group, Rovira i Virgili University of Tarragona, Av. Països Catalans 26, Tarragona, Catalonia
Josep Maria Mateo-Sanz

Authors

Josep Domingo-Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Antoni Martínez-Ballesté
View author publications
You can also search for this author in PubMed Google Scholar
Josep Maria Mateo-Sanz
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Sebé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josep Domingo-Ferrer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M. et al. Efficient multivariate data-oriented microaggregation. The VLDB Journal 15, 355–369 (2006). https://doi.org/10.1007/s00778-006-0007-0

Download citation

Received: 30 September 2005
Accepted: 25 May 2006
Published: 29 August 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s00778-006-0007-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient multivariate data-oriented microaggregation

Abstract

Access this article

Similar content being viewed by others

Efficient Near-Optimal Variable-Size Microaggregation

Beyond Multivariate Microaggregation for Large Record Anonymization

New Multi-dimensional Sorting Based K-Anonymity Microaggregation for Statistical Disclosure Control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient multivariate data-oriented microaggregation

Abstract

Access this article

Similar content being viewed by others

Efficient Near-Optimal Variable-Size Microaggregation

Beyond Multivariate Microaggregation for Large Record Anonymization

New Multi-dimensional Sorting Based K-Anonymity Microaggregation for Statistical Disclosure Control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation