Anonymization is the modification of data to mask the correspondence between a person and sensitive information in the data. Several anonymization models such as k-anonymity have been intensively studied. Recently, a new model with less information loss than existing models was proposed; this is a type of non-homogeneous generalization. In this paper, we present an alternative anonymization algorithm that further reduces the information loss using optimization techniques. We also prove that a modified dataset is checked whether it satisfies the k-anonymity by a polynomial-time algorithm. Computational experiments were conducted and demonstrated the efficiency of our algorithm even on large datasets.

Similar content being viewed by others
Sacharidis, D., Mouratidis, K., Papadias, D.: k-Anonymity in the presence of external databases. IEEE Trans. Knowl. Data Eng. 22(3), 392–403 (2010)
Dalenius, T.: Finding a needle in a haystack or identifying anonymous census record. J. Off. Stat. 2(3), 329–336 (1986)
Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Fourth IEEE International Conference on Data Mining, 2004. ICDM’04, pp. 249–256 (2004)
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, p. 188 (1998)
Fung, B., Wang, K., YU, P.: Top-down specialization for information and privacy preservation. In: Proceedings. 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 205–216 (2005)
Samarati, P.: Protecting respondants identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Sun, X., Li, M., Wang, H., Plank, A.: An efficient hash-based algorithm for minimal k-anonymity. In: Proceedings of the Thirty-First Australasian Conference on Computer Science, vol. 74, pp. 101–107 (2008)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: \({k}\)-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98–04, SRI Computer Science Laboratory (1998)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R., Incognito: Efficient full-domain \({k}\)-anonymity. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, vol. 21, pp. 49–60 (2005)
Machanavajjhala, A., Gehrke, J., Kifer, D.: l-Diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(3), 1–52 (2007)
Domingo-Ferrer, J.: Microaggregation for database and location privacy. In: Etzion, O., Kuflik, T., Motro, A. (eds.) Next Generation Information Technologies and Systems. Lecture Notes in Computer Science, vol. 4032. Springer, Berlin, Heidelberg, pp. 106–116 (2006)
Campan, A., Truta, T.M., Miller, J., Sinca, R.A.: A clustering approach for achieving data privacy. In: Proceedings of the International Data Mining Conference, pp. 321–330 (2007)
Goldberg, A.V., Tarjan, R.E.: Efficient maximum flow algorithms. Commun. ACM 57(8), 82–89 (2014)
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the 10th International Conference on Database Theory, LNCS, 3363, pp. 246–258 (2005)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Wong, W.K., Mamoulis, N., Cheung, D.W.-L.: Non-homogeneous generalization in privacy preserving data publishing. In: The ACM SIGMOD International Conference on Data Management (SIGMOD), pp. 747–758 (2010)
Murakami, K., Uno, T.: A matching model and an algorithm for k-anonymity of large-scale data. In: Proceedings of the 15th Korea-Japan Joint Workshop on Algorithms and Computation, pp. 154–160 (2012)
Shmueli, E., Tassa, T., Wasserstein, R., Shapira, B., Rokach, L.: Limiting disclosure of sensitive data in sequential releases of databases. Inf. Sci. 191, 98–127 (2012)
Shmueli, E., Tassa, T.: Privacy by diversity in sequential releases of databases. Inf. Sci. 298, 344–372 (2015)
Goldberg, A.V.: An Efficient implementation of a scaling minimum-cost flow algorithm. J. Algorithms 22(1), 1–29 (1997)
Hall, P.: On representatives of subsets. J. Lond. Math. Soc. 10(1), 26–30 (1935)
Goldberg, A.V., Tarjan, R.E.: Finding minimum-cost circulations by canceling negative cycles. J. ACM 36(4), 873–886 (1989)
Sokkalingam, P.T.: New polynomial-time cycle-canceling algorithms for minimum cost flows. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (1997)
Uno, T.: Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl. Inf. Syst. 25(2), 229–251 (2010)
Part of this research is supported by the Funding Program for World-Leading Innovative R&D on Science and Technology, Japan. We thank Professor Wong for providing us with the programs used in our experiments.
Author information
Authors and Affiliations
Corresponding author
The gaps between \(\hbox {GCP}_\mathrm{t}\) and \(\hbox {GCP}_\mathrm{unsort}\) are computed by
where \(\hbox {GCP}_\mathrm{unsort}\) represents the information loss of the anonymized dataset obtained using our algorithm without preliminarily sorting. Note that we do not compute Gap_sort for the instances of \(|\mathcal{T}|=100\hbox {k}\) because the datasets are not partitioned; thus, \(\hbox {GCP}_\mathrm{unsort}\) equals to \(\hbox {GCP}_\mathrm{t}\) (Tables 19, 20).
Rights and permissions
About this article
Cite this article
Murakami, K., Uno, T. Optimization algorithm for k-anonymization of datasets with low information loss. Int. J. Inf. Secur. 17, 631–644 (2018). https://doi.org/10.1007/s10207-017-0392-y
Issue Date:
DOI: https://doi.org/10.1007/s10207-017-0392-y