Abstract
Incomplete microdata, i.e., microdata with missing value, is very common in real-world datasets. However, existing anonymization techniques, which were developed for complete datasets, suffer from serious information loss on incomplete microdata, due to the missing value pollution. In this paper, we propose a framework for utility enhanced anonymization of incomplete microdata to address this issue. First, we study the properties of missing value pollution on generalization. Guided by these properties, we develop two top-down anonymization algorithms to preserve data utility on incomplete microdata. Extensive experiments on real-world datasets show that our techniques outperform the state-of-the-art techniques in terms of information loss and missing value pollution.
Similar content being viewed by others
Notes
Mondrian, Enhanced Mondrian and semi-partition.
Downloadable at http://archive.ics.uci.edu/ml/datasets/Adult.
Downloadable at https://sites.google.com/site/informsdataminingcontest/.
According to the documents provided by UCI and INFORMS, ‘?’ in Adult data and -1, -7, -8, -9 in INFORMS are considered as missing values.
We assume age range is [1, 100], and Zipcode range is [10001, 50000].
References
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Survey 42, 14:1–14:53 (2010). doi:10.1145/1749603.1749605
Markkula, J.: Dynamic geographic personal data—new opportunity and challenge introduced by the location-aware mobile networks. Cluster Comput. 4(4), 369–377 (2001)
Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE’06: Proceedings of the 22nd International Conference on Data Engineering, p. 25. IEEE Computer Society, Washington, DC (2006)
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.-C.: Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD’06, pp. 785–790. ACM, New York (2006). doi:10.1145/1150402.1150504
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference on Very Large Data Bases, ser. VLDB’07. VLDB Endowment, pp. 758–769 (2007). Available http://portal.acm.org/citation.cfm?id=1325851.1325938
Nergiz, M., Clifton, C., Nergiz, A.: Multirelational k-anonymity. IEEE Trans. Knowl. Data Eng. 21(8), 1104–1117 (2009)
Gong, Q., Luo, J., Yang, M.: Aim: a new privacy preservation algorithm for incomplete microdata based on anatomy. In: Proceedings of the 2012 International Conference on Pervasive Computing and the Networked World, ser. ICPCA/SWS’12, pp. 194–208. Springer, Berlin (2013). doi:10.1007/978-3-642-37015-1_16
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 3 (2007)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering (ICDE), IEEE, pp. 106–115 (2007)
Cao, J., Karras, P.: Publishing microdata with a robust privacy guarantee. Proc. VLDB Endow. 5(11), 1388–1399 (2012). doi:10.14778/2350229.2350255
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008). doi:10.1145/1453856.1453874
Gong, Q., Luo, J., Yang, M., Ni, W., Li, X.-B.: Anonymizing 1:m microdata with high utility. Knowl. Based Syst. 115, 15–26 (2017)
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: PODS’04: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM, New York (2004)
Xiao, X., Yi, K., Tao, Y.: The hardness and approximation algorithms for l-diversity. In EDBT’10: Proceedings of the 13th International Conference on Extending Database Technology, pp. 135–146. ACM, New York (2010)
He, Y., Naughton, J.F.: Anonymization of set-valued data via top-down, local generalization. Proc. VLDB Endow. 2(1), 934–945 (2009)
Zakerzadeh, H., Aggarwal, C.C., Barker, K.: Privacy-preserving big data publishing. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM’15, pp. 26:1–26:11. ACM, New York (2015). doi:10.1145/2791347.2791380
Ni, W., Chong, Z.: Clustering-oriented privacy-preserving data publishing. Knowl. Based Syst. 35, 264–270 (2012)
Guo, K., Zhang, Q.: Fast clustering-based anonymization approaches with time constraints for data streams. Knowl. Based Syst. 46, 95–108 (2013)
Bhuyan, H.K., Kamila, N.K.: Privacy preserving sub-feature selection based on fuzzy probabilities. Cluster Comput. 17(4), 1383–1399 (2014)
Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in privacy preserving data publishing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD’10, pp. 747–758. ACM, New York (2010). doi:10.1145/1807167.1807248
Xue, M., Karras, P., Raïssi, C., Vaidya, J., Tan, K.-L.: Anonymizing set-valued data by nonreciprocal recoding. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD’12, pp. 1050–1058. ACM, New York (2012). doi:10.1145/2339530.2339696
Doka, K., Xue, M., Tsoumakos, D., Karras, P.: k-anonymization by freeform generalization. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, ser. ASIA CCS’15, pp. 519–530. ACM, New York (2015). doi:10.1145/2714576.2714590
Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Brown, M.L., Kros, J.F.: Data mining and the impact of missing data. Ind. Manag. Data Syst. 103(8), 611–621 (2003)
Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M., Tan, C. (eds.) Transactions on Computational Science I, ser. Lecture Notes in Computer Science, vol. 4750, pp. 128–138. Springer, Berlin (2008). doi:10.1007/978-3-540-79299-4_7
Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)
Zhang, X., Leckie, C., Dou, W., Chen, J., Kotagiri, R., Salcic, Z.: Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM’16, pp. 1793–1802. ACM, New York (2016). doi:10.1145/2983323.2983841
Chen, B., Tan, C., Zou, X.: Cloud service platform of electronic identity in cyberspace. Cluster Comput. 1–13 (2017). doi:10.1007/s10586-017-0731-9
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In SIGMOD’05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49–60. ACM, New York (2005)
Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) (2013)
Bayardo, R.J., Agrawal, R.: Data Privacy Through Optimal k-Anonymization. IEEE Computer Society, Los Alamitos (2005)
Byun, J.-W., Kamra, A., Bertino, E., Li, N.: Efficient k-anonymization using clustering techniques. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications, ser. DASFAA’07, pp. 188–200. Springer, Berlin (2007)
Acknowledgements
This work is supported by National Natural Science Foundation of China under Grants No. 61572130, 61632008, 61320106007, 61502100 and 61402104, Jiangsu Provincial Natural Science Foundation under Grants BK20150628, BK20140648 and BK20150637, Jiangsu Provincial Key Technology R&D Program under Grant BE2014603, Qing Lan Project of Jiangsu Province, Jiangsu Provincial Key Laboratory of Network and Information Security under Grant BM2003201, and Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant 93K-9.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gong, Q., Yang, M., Chen, Z. et al. A framework for utility enhanced incomplete microdata anonymization. Cluster Comput 20, 1749–1764 (2017). https://doi.org/10.1007/s10586-017-0795-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0795-6