Efficient microaggregation techniques for large numerical data volumes

Solé, Marc; Muntés-Mulero, Victor; Nin, Jordi

doi:10.1007/s10207-012-0158-5

Efficient microaggregation techniques for large numerical data volumes

Regular Contribution
Published: 08 February 2012

Volume 11, pages 253–267, (2012)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Marc Solé^1,2,
Victor Muntés-Mulero³ &
Jordi Nin¹

320 Accesses
23 Citations
6 Altmetric
Explore all metrics

Abstract

The contradictory requirements of data privacy and data analysis have fostered the development of statistical disclosure control techniques. In this context, microaggregation is one of the most frequently used methods since it offers a good trade-off between simplicity and quality. Unfortunately, most of the currently available microaggregation algorithms have been devised to work with small datasets, while the size of current databases is constantly increasing. The usual way to tackle this problem is to partition large data volumes into smaller fragments that can be processed in reasonable time by available algorithms. This solution is applied at the cost of losing quality. In this paper, we revisited the computational needs of microaggregation showing that it can be reduced to two steps: sorting the dataset with regard to a vantage point and a set of k-nearest neighbors searches. Considering this new point of view, we propose three new efficient quality-preserving microaggregation algorithms based on k-nearest neighbors search techniques. We present a comparison of our approaches with the most significant strategies presented in the literature using three real very large datasets. Experimental results show that our proposals overcome previous techniques by keeping a better balance between performance and the quality of the anonymized dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

IPUMS-international. URL https://international.ipums.org/international
UCI KDD archive. URL http://kdd.ics.uci.edu
Arya S., Mount D., Netanyahu N., Silverman R., Wu A.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998)
Article MathSciNet MATH Google Scholar
Bentley J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Bentley, J.: K-d trees for semidynamic point sets. In: Proceedings of the 6th Symposium on Computational Geometry, pp. 187–197 (1990)
Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree : an index structure for high-dimensional data. In: Proceedings of 22th International Conference on Very Large Data Bases, pp. 28–39 (1996)
Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference datasets to test and compare sdc methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC (2002)
Chávez E., Navarro G., Baeza-Yates R., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33, 273–321 (2001)
Article Google Scholar
Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)
Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 92th Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)
Domingo-Ferrer J., Martínez-Ballesté A., Mateo-Sanz J.M., Sebé F.: Efficient multivariate data-oriented microaggregation. Very Large Data Bases J. 15(4), 355–369 (2006)
Article Google Scholar
Domingo-Ferrer J., Mateo-Sanz J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)
Article Google Scholar
Domingo-Ferrer, J., Sebé, F., Solanas, A.: Microaggregation heuristics for p-sensitive k-anonymity. In: Proceedings of Joint UNECE/Eurostat work session on statistical data confidentiality (2007)
Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. In: Computers and Mathematics with Applications, vol. 55, pp. 714–732 (2008)
Domingo-Ferrer J., Torra V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)
Article MathSciNet Google Scholar
Domingo-Ferrer, J., Torra, V., Mateo-Sanz, J.M., Sebé, F.: Systematic measures of re-identification risk based on the probabilistic links of the partially synthetic data back to the original microdata. Technical report, Cornell University (2005)
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference Very Large Data Bases, pp. 758–769 (2007)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM International Conference on Management of data, pp. 47–57 (1984)
Hansen S.L., Mukherjee S.: A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Engi. 15(4), 1043–1044 (2003)
Article Google Scholar
Hore, B., Jammalamadaka, R.C., Mehrotra, S.: Flexible anonymization for privacy preserving data publishing: a systematic search based approach. In: Proceedings of the 7th SIAM International Conference on Data Mining (2007)
Hundepool, A., deWetering deWetering, A.V., Ramaswamy, R., Franconi, L., Polettini, S., Capo-bianchi, A., de Wolf, P.P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 4.1 software and users manual. http://neon.vb.cbs.nl/casc (2007)
Indyk P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds) Handbook of Discrete and Computational Geometry, 2nd edn, CRC Press LLC, Boca Raton (2004)
Google Scholar
Jaro M.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84, 414–420 (1989)
Article Google Scholar
Jian-min, H., Ting-ting, C., Hui-qun: An improved v-mdav algorithm for l-diversity. In: International Symposiums on Information Processing, pp. 733–739 (2008)
Kokolakis G., Fouskakis D.: Importance partitioning in micro-aggregation. Comput. Stat. Data Anal. 53(7), 2439–2445 (2009)
Article MathSciNet MATH Google Scholar
Laszlo M., Mukherjee S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7), 902–911 (2005)
Article Google Scholar
Lee D.T., Wong C.K.: Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 23–29 (1977)
Article MathSciNet MATH Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering (2006)
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (2006)
Mount, D., Arya, S.: ANN: a library for approximate nearest neighbor searching. URL http://www.cs.umd.edu/~mount/ANN
Navarro G.: Searching in metric spaces by spatial approximation. Very Large Data Bases J. 11(1), 28–46 (2002)
Article Google Scholar
Oganian A., Domingo-Ferrer J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. U.N. Econ. Comm. Eur. 18(4), 345–354 (2000)
Google Scholar
Sample, N., Haines, M., Arnold, M., Purcell, T.: Optimizing search strategies in k-d trees. In: 5th WSES/IEEE World Multiconference on Circuits, Systems, Communications & Computers (CSCC) (2001)
Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of computational geometry, pp. 877–935. North-Holland (2000)
Solanas, A., Martínez-Ballesté, A.: V-MDAV: a multivariate microaggregation with variable group size. In: Computational Statistics (COMPSTAT), pp. 917–925 (2006)
Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J., Mateo-Sanz, J.M.: A 2^d-tree-based blocking method for microaggregating very large data sets. In: International Conference on Availability, Reliability and Security, pp. 922–928 (2006)
Solanas, A., Pietro, R.: A linear-time multivariate micro-aggregation for privacy protection in uniform very large data sets. In: Proceedings of the 5th International Conference on Modeling Decisions for Artificial Intelligence, pp. 203–214 (2008)
Sweeney L.: -anonymity: a model for protecting privacy k. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Article MathSciNet MATH Google Scholar
Templ, M.: sdcMicro. Manual and Package. Version 2.5.1. Statistics Austria and Vienna University of Technology, http://cran.r-project.org/src/contrib/Descriptions/sdcMicro.html (2008)
Truta, T.M., Vinay, B.: Privacy protection: p-sensitive k-anonymity property. In: IEEE International Confernce on Data Engineering Workshops (2006)
Willenborg L., de Waal T.: Elements of Statistical Diclosure Control. Lecture Notes in Statistics. Springer, Berlin (2001)
Book Google Scholar
Wong, W.K., Mamoulis, N., Cheung, D.W.: Non-homogeneous generalization in privacy preserving data publishing. In: ACM International Conference on Management of Data (SIGMOD), pp. 747–758 (2010)
Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: an efficient method for KNN join processing. In: Proceedings of the International Conference on Very large data bases, pp. 756–767 (2004)
Yuan, C., Gersho, A., Ramamurthi, B., Shoham, Y.: Fast search algorithms for vector quantization and pattern matching. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 372–375 (1984)

Download references

Author information

Authors and Affiliations

Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Campus Nord, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Marc Solé & Jordi Nin
Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Campus Nord, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Marc Solé
CA Labs, CA Technologies, WTC Almeda Park. Pl. de la Pau, s/n, 08940, Cornellà de Llobregat, Catalonia, Spain
Victor Muntés-Mulero

Authors

Marc Solé
View author publications
You can also search for this author in PubMed Google Scholar
Victor Muntés-Mulero
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Nin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jordi Nin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Solé, M., Muntés-Mulero, V. & Nin, J. Efficient microaggregation techniques for large numerical data volumes. Int. J. Inf. Secur. 11, 253–267 (2012). https://doi.org/10.1007/s10207-012-0158-5

Download citation

Published: 08 February 2012
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10207-012-0158-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient microaggregation techniques for large numerical data volumes

Abstract

Access this article

Similar content being viewed by others

Efficient Near-Optimal Variable-Size Microaggregation

TBM, a transformation based method for microaggregation of large volume mixed data

Beyond Multivariate Microaggregation for Large Record Anonymization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient microaggregation techniques for large numerical data volumes

Abstract

Access this article

Similar content being viewed by others

Efficient Near-Optimal Variable-Size Microaggregation

TBM, a transformation based method for microaggregation of large volume mixed data

Beyond Multivariate Microaggregation for Large Record Anonymization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation