Abstract
For research in medicine, economics and social sciences specific data of individuals is needed. Thus it should be publicly available, but this should not offend the privacy of each individual. Microaggregation applied to databases is a standard technique to protect privacy. It clusters similar people in larger groups to achieve so called k-anonymity – every individual is hidden in a cluster of size at least k. Then the data can be made public for all kinds of analysis, whereas other concepts like differential privacy keep the database secret and allow only specific questions about the data to be asked by outsiders.
The modification of a database to achieve anonymity should be as small as possible to keep its utility – that means the loss of information should be minimized. In this respect microaggregation typically performs much better than other anonymization techniques like generalization or suppression. However, minimizing the information loss by k-anonymous microaggregation is an NP-hard optimization problem for \(k \ge 3\). Not only computing optimal solutions efficiently is unlikely, nontrivial approximations are lacking, too. Therefore, a bunch of heuristics all with at least quadratic time complexity have been developed.
This paper improves microaggregation significantly and provides a tradeoff between computational effort and utility. First, we make a detailed analysis and tuning of the maximum distance methodology – the common approach to generate a clustering that provides k-anonymity. We review the methods proposed so far and design a new algorithm \(\texttt{MDAV}^{*}_\gamma \) that gives better utility on standard benchmarks.
A different approach of quadratic time complexity based on Lloyd’s algorithm has been proposed and named ONA, but not completely analysed. This paper fills this gap and improves several steps resulting in a new algorithm \(\texttt{ONA}^{*}\) with better utility.
Mondrian is a another approach for clustering data that can be adopted for microaggregation. It is quite fast, but typically achieves very pure utility. We improve on this and design an almost linear time algorithm that gives acceptable utility, however worse than the quadratic time algorithms.
Finally, we combine both techniques, ONA and Mondrian, to construct a new class of parameterized algorithms called \(\texttt{MONA}\). They are quite fast with time complexity between almost linear and quadratic, and deliver competitive utility compared to the MDAV approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anwar, N.: Micro-aggregation-the small aggregates method. Technical report, Internal report. Luxembourg: Eurostat (1993)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Defays, D., Anwar, M.N.: Masking microdata using micro-aggregation. J. Offic. Stat. 14(4), 449 (1998)
Defays, D., Nanopoulos, Ph.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204 (1993)
Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M., Sebé, F.: Efficient multivariate data-oriented microaggregation. VLDB J. Int. J. Very Large Data Bases 15(4), 355–369 (2006)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference data sets to test and compare sdc methods for protection of numerical microdata (2002). https://web.archive.org/web/20190412063606/http://neon.vb.cbs.nl/casc/CASCtestsets.htm
Josep Domingo-Ferrer and Josep Maria Mateo-Sanz: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Disc. 11(2), 195–212 (2005)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: 22nd International Conference on Data Engineering (ICDE’06), pp. 25–25. IEEE (2006)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115. IEEE (2007)
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, pp. 32–33. ACM (2012)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discovery Data (TKDD) 1(1), 3 (2007)
Sanz, J.M.M., Ferrer, J.D.: A comparative study of microaggregation methods. Qüestiió 22(3) (1998)
Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. U. N. Econ. Comm. Eur. 18(4), 345–353 (2001)
Rebollo-Monedero, D., Forné, J., Pallarès, E., Parra-Arnau, J.: A modification of the lloyd algorithm for k-anonymous quantization. Inf. Sci. 222, 185–202 (2013)
Samarati, P.: Protecting respondents identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J.: V-mdav: a multivariate microaggregation with variable group size. In: 17th COMPSTAT Symposium of the IASC, Rome, pp. 917–925 (2006)
Soria-Comas, J., Domingo-Ferrer, J., Mulero, R.: Efficient near-optimal variable-size microaggregation. In: Torra, V., Narukawa, Y., Pasi, G., Viviani, M. (eds.) MDAI 2019. LNCS (LNAI), vol. 11676, pp. 333–345. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26773-5_29
Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data - anonymisation groundhog day. In: USENIX 2022, to appear
Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Thaeter, F.: k-anonymous microaggregation. Dissertation, Universität zu Lübeck (2021)
Thaeter, F., Reischuk, R.: Improving anonymization clustering. In: Langweg, H., Meier, M., Witt, B.C., Reinhardt, D. (eds.) SICHERHEIT 2018, pp. 69–82, Bonn (2018). Gesellschaft für Informatik e.V
Thaeter, F., Reischuk, R.: Hardness of k-anonymous microaggregation. Discret. Appl. Math. 303, 149–158 (2021)
Thaeter, F., Reischuk, R.: Scalable k-anonymous microaggregation: Exploiting the tradeoff between computational complexity and information loss. In: 18th International Conference on Security and Cryptography (SECRYPT), pp. 87–98 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Thaeter, F., Reischuk, R. (2023). Improving Time Complexity and Utility of k-anonymous Microaggregation. In: Samarati, P., van Sinderen, M., Vimercati, S.D.C.d., Wijnhoven, F. (eds) E-Business and Telecommunications. ICETE 2021. Communications in Computer and Information Science, vol 1795. Springer, Cham. https://doi.org/10.1007/978-3-031-36840-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-36840-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36839-4
Online ISBN: 978-3-031-36840-0
eBook Packages: Computer ScienceComputer Science (R0)