Abstract
Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released while preserving the privacy of the underlying individuals. The principle of microaggregation is to aggregate original database records into small groups prior to publication. Each group should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. Recently, microaggregation has been shown to be useful to achieve k-anonymity, in addition to it being a good masking method. Optimal microaggregation (with minimum within-groups variability loss) can be computed in polynomial time for univariate data. Unfortunately, for multivariate data it is an NP-hard problem. Several heuristic approaches to microaggregation have been proposed in the literature. Heuristics yielding groups with fixed size k tends to be more efficient, whereas data-oriented heuristics yielding variable group size tends to result in lower information loss. This paper presents new data-oriented heuristics which improve on the trade-off between computational complexity and information loss and are thus usable for large datasets.
Similar content being viewed by others
References
Agrawal, D., Aggarwal, C.C. On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the Symposium on Principles of Database Systems-PODS’2001, Santa Barbara. Association for Computing Machinery, (2001)
Boyens, C., Krishnan, R., Padman, R. On privacy-preserving access to distributed heterogeneous healthcare information. In: Proceedings of the 37th Hawaii International Conference on System Sciences HICSS-37, Big Island, HI IEEE Computer Society (2004)
Brand R. (2002). Microdata protection through noise addition. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol 2316 of LNCS, Springer, Berlin Heidelberg New York, pp. 97–116
Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M. Reference data sets to test and compare sdc methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc (2002)
Burridge J. (2003) Information preserving statistical obfuscation. Stat. Comput. 13, 321–327
Dalenius T. (1986) Finding a needle in a haystack–or identifying anonymous census records. J. Official Stat. 23, 329–336
Dandekar R., Domingo-Ferrer J., Sebé F. (2002). LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg NewYork, pp. 153–162
Defays, D., Anwar, N. Micro-aggregation: a generic method. In: Proceedings of the 2nd International Symposium on Statistical Confidentiality, pp. 69–78. Eurostat, Luxemburg (1995)
Defays, D., Nanopoulos, P. Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)
Domingo-Ferrer J., Mateo-Sanz J.M. (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1): 189–201
Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Luxemburg, Eurostat (2001)
Domingo-Ferrer, J., Torra, V. A quantitative comparison of disclosure control methods for microdata. In: Doyle P., Lane J.I., Theeuwes J. J. M., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–134. Amsterdam North-Holland, http://vneumann.etse.urv.es/publications/bcpi (2001)
Domingo-Ferrer J., Torra V. (2005) Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining Knowl. Discov. 11(2): 195–212
Doyle, P., Lane, J.I., Theeuwes, J.J., Zayatz, L.V. (eds). Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam (2001)
Edwards A.W.F., Cavalli-Sforza L.L. (1965) A method for cluster analysis. Biometrics 21, 362–375
Gordon A.D., Henderson J.T. (1977) An algorithm for Euclidean sum of squares classification. Biometrics 33, 355–362
Hansen P., Jaumard B., Mladenovic N. (1998) Minimum sum of squares clustering in a low dimensional space. J. Classifi. 15, 37–55
Hansen S.L., Mukherjee S. (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Eng. 15(4): 1043–1044
Hartigan J.A. (1975) Clustering Algorithms. Wiley, New York
Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A. DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 3.2 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2003)
Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 4.0 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2005)
Jancey R.C. (1966) Multidimensional group analysis. Aust. J. Bot. 14, 127–130
Laszlo M., Mukherjee S. (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7): 902–911
Lenz, R., Vorgrimler, D. Matching German turnover tax statistics. In: Technical Report FDZ-Arbeitspapier Nr. 4, Statistische Ämter des Bundes und der Länder-Forschungsdatenzentren (2005)
MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol., 1, 281–297 (1967)
Mateo-Sanz, J.M., Domingo-Ferrer, J. A method for data-oriented multivariate microaggregation. In: Domingo-Ferrer, J., (ed.) Statistical Data Protection, (pp. 89–99) Luxemburg, (1999) Office for Official Publications of the European Communities
Mateo-Sanz, J.M., Domingo-Ferrer, J. Heuristic techniques for multivariate microaggregation. In: COMPSTAT’2000, Utrecht. CBS-Statistics, Netherlands (2000)
Mateo-Sanz J.M., Martínez-Ballesté A., Domingo-Ferrer J. (2004). Fast generation of accurate synthetic microdata. In: Domingo-Ferrer J., Torra V. (eds). Privacy in Statistical Databases, volume 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 298–306
Oganian A., Domingo-Ferrer J. (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Com. Eur. 18(4): 345–354
Pagliuca, D., Seri, G. Some results of the individual ranking method on the system of enterprise accounts annual survey. In: Technical report, ESPRIT SDC Project, Deliverable MI-3/D2.11 (1999)
Rosemann, M. Erste Ergebnisse von vergleichenden Untersuchungen mit anonymisierten und nicht anonymisierten Einzeldaten am Beispiel der Kostenstrukturerhebung und der Umsatzsteuerstatistik. In: Ronning, G., Gnoss, R., (eds.), Anonymisierung wirtschaftsstatistischer Einzeldaten, (pp.154–183) Wiesbaden, Germany, Statistisches Bundesamt (2003)
Samarati P. (2001) Protecting respondents’ identities in microdata release. IEEE Trans. Know. and Data Eng. 13(6): 1010–1027
Samarati, P., Sweeney, L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In Technical report, SRI International, (1998)
Sande G. (2002) Exact and approximate methods for data directed microaggregation in one or more dimensions. Int. J. Uncert. Fuzziness Know. Based Sys. 10(5): 459–476
Sweeney L. (2002) k-Anonimity: a model for protecting privacy. Int. J. Uncert. Fuzziness Knowl. Based Sys. 10(5): 557–570
Torra V. (2004). Microaggregation for categorical variables: a median based approach. In: Domingo-Ferrer J., Torra V. (eds). Privacy Stat. Databases vol. 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 162–174
Torra V., Domingo-Ferrer J. (2003). Record linkage methods for multidatabase data mining. In: Torra V. (eds). Information Fusion in Data Mining. Springer, Germany, pp.101–132
UNECE. United Nations Economic Commission for Europe: Questionnaire on disclosure and confidentiality–summary of replies. In: 2nd Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, Macedonia (2001)
UNECE. United Nations Economic Commission for Europe: 2003 Questionnaire on statistical confidentiality – summary of replies from Central and Eastern Europe. In: 4th Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg (2005)
Ward J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244
Willenborg L., DeWaal T. (2001) Elements of Statistical Disclosure Control. Springer, Berlin Heidelberg New York
Yancey W.E., Winkler W.E., Creecy R.H. (2002). Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg New York, pp. 135–152
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M. et al. Efficient multivariate data-oriented microaggregation. The VLDB Journal 15, 355–369 (2006). https://doi.org/10.1007/s00778-006-0007-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-006-0007-0