Skip to main content
Log in

Efficient microaggregation techniques for large numerical data volumes

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

The contradictory requirements of data privacy and data analysis have fostered the development of statistical disclosure control techniques. In this context, microaggregation is one of the most frequently used methods since it offers a good trade-off between simplicity and quality. Unfortunately, most of the currently available microaggregation algorithms have been devised to work with small datasets, while the size of current databases is constantly increasing. The usual way to tackle this problem is to partition large data volumes into smaller fragments that can be processed in reasonable time by available algorithms. This solution is applied at the cost of losing quality. In this paper, we revisited the computational needs of microaggregation showing that it can be reduced to two steps: sorting the dataset with regard to a vantage point and a set of k-nearest neighbors searches. Considering this new point of view, we propose three new efficient quality-preserving microaggregation algorithms based on k-nearest neighbors search techniques. We present a comparison of our approaches with the most significant strategies presented in the literature using three real very large datasets. Experimental results show that our proposals overcome previous techniques by keeping a better balance between performance and the quality of the anonymized dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. IPUMS-international. URL https://international.ipums.org/international

  2. UCI KDD archive. URL http://kdd.ics.uci.edu

  3. Arya S., Mount D., Netanyahu N., Silverman R., Wu A.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bentley J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bentley, J.: K-d trees for semidynamic point sets. In: Proceedings of the 6th Symposium on Computational Geometry, pp. 187–197 (1990)

  6. Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree : an index structure for high-dimensional data. In: Proceedings of 22th International Conference on Very Large Data Bases, pp. 28–39 (1996)

  7. Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference datasets to test and compare sdc methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC (2002)

  8. Chávez E., Navarro G., Baeza-Yates R., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33, 273–321 (2001)

    Article  Google Scholar 

  9. Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)

  10. Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 92th Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)

  11. Domingo-Ferrer J., Martínez-Ballesté A., Mateo-Sanz J.M., Sebé F.: Efficient multivariate data-oriented microaggregation. Very Large Data Bases J. 15(4), 355–369 (2006)

    Article  Google Scholar 

  12. Domingo-Ferrer J., Mateo-Sanz J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)

    Article  Google Scholar 

  13. Domingo-Ferrer, J., Sebé, F., Solanas, A.: Microaggregation heuristics for p-sensitive k-anonymity. In: Proceedings of Joint UNECE/Eurostat work session on statistical data confidentiality (2007)

  14. Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. In: Computers and Mathematics with Applications, vol. 55, pp. 714–732 (2008)

  15. Domingo-Ferrer J., Torra V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)

    Article  MathSciNet  Google Scholar 

  16. Domingo-Ferrer, J., Torra, V., Mateo-Sanz, J.M., Sebé, F.: Systematic measures of re-identification risk based on the probabilistic links of the partially synthetic data back to the original microdata. Technical report, Cornell University (2005)

  17. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference Very Large Data Bases, pp. 758–769 (2007)

  18. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM International Conference on Management of data, pp. 47–57 (1984)

  19. Hansen S.L., Mukherjee S.: A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Engi. 15(4), 1043–1044 (2003)

    Article  Google Scholar 

  20. Hore, B., Jammalamadaka, R.C., Mehrotra, S.: Flexible anonymization for privacy preserving data publishing: a systematic search based approach. In: Proceedings of the 7th SIAM International Conference on Data Mining (2007)

  21. Hundepool, A., deWetering deWetering, A.V., Ramaswamy, R., Franconi, L., Polettini, S., Capo-bianchi, A., de Wolf, P.P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 4.1 software and users manual. http://neon.vb.cbs.nl/casc (2007)

  22. Indyk P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds) Handbook of Discrete and Computational Geometry, 2nd edn, CRC Press LLC, Boca Raton (2004)

    Google Scholar 

  23. Jaro M.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84, 414–420 (1989)

    Article  Google Scholar 

  24. Jian-min, H., Ting-ting, C., Hui-qun: An improved v-mdav algorithm for l-diversity. In: International Symposiums on Information Processing, pp. 733–739 (2008)

  25. Kokolakis G., Fouskakis D.: Importance partitioning in micro-aggregation. Comput. Stat. Data Anal. 53(7), 2439–2445 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  26. Laszlo M., Mukherjee S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7), 902–911 (2005)

    Article  Google Scholar 

  27. Lee D.T., Wong C.K.: Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 23–29 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  28. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering (2006)

  29. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (2006)

  30. Mount, D., Arya, S.: ANN: a library for approximate nearest neighbor searching. URL http://www.cs.umd.edu/~mount/ANN

  31. Navarro G.: Searching in metric spaces by spatial approximation. Very Large Data Bases J. 11(1), 28–46 (2002)

    Article  Google Scholar 

  32. Oganian A., Domingo-Ferrer J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. U.N. Econ. Comm. Eur. 18(4), 345–354 (2000)

    Google Scholar 

  33. Sample, N., Haines, M., Arnold, M., Purcell, T.: Optimizing search strategies in k-d trees. In: 5th WSES/IEEE World Multiconference on Circuits, Systems, Communications & Computers (CSCC) (2001)

  34. Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of computational geometry, pp. 877–935. North-Holland (2000)

  35. Solanas, A., Martínez-Ballesté, A.: V-MDAV: a multivariate microaggregation with variable group size. In: Computational Statistics (COMPSTAT), pp. 917–925 (2006)

  36. Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J., Mateo-Sanz, J.M.: A 2d-tree-based blocking method for microaggregating very large data sets. In: International Conference on Availability, Reliability and Security, pp. 922–928 (2006)

  37. Solanas, A., Pietro, R.: A linear-time multivariate micro-aggregation for privacy protection in uniform very large data sets. In: Proceedings of the 5th International Conference on Modeling Decisions for Artificial Intelligence, pp. 203–214 (2008)

  38. Sweeney L.: -anonymity: a model for protecting privacy k. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  39. Templ, M.: sdcMicro. Manual and Package. Version 2.5.1. Statistics Austria and Vienna University of Technology, http://cran.r-project.org/src/contrib/Descriptions/sdcMicro.html (2008)

  40. Truta, T.M., Vinay, B.: Privacy protection: p-sensitive k-anonymity property. In: IEEE International Confernce on Data Engineering Workshops (2006)

  41. Willenborg L., de Waal T.: Elements of Statistical Diclosure Control. Lecture Notes in Statistics. Springer, Berlin (2001)

    Book  Google Scholar 

  42. Wong, W.K., Mamoulis, N., Cheung, D.W.: Non-homogeneous generalization in privacy preserving data publishing. In: ACM International Conference on Management of Data (SIGMOD), pp. 747–758 (2010)

  43. Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: an efficient method for KNN join processing. In: Proceedings of the International Conference on Very large data bases, pp. 756–767 (2004)

  44. Yuan, C., Gersho, A., Ramamurthi, B., Shoham, Y.: Fast search algorithms for vector quantization and pattern matching. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 372–375 (1984)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jordi Nin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Solé, M., Muntés-Mulero, V. & Nin, J. Efficient microaggregation techniques for large numerical data volumes. Int. J. Inf. Secur. 11, 253–267 (2012). https://doi.org/10.1007/s10207-012-0158-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-012-0158-5

Keywords

Navigation