Skip to main content
Log in

A framework for utility enhanced incomplete microdata anonymization

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Incomplete microdata, i.e., microdata with missing value, is very common in real-world datasets. However, existing anonymization techniques, which were developed for complete datasets, suffer from serious information loss on incomplete microdata, due to the missing value pollution. In this paper, we propose a framework for utility enhanced anonymization of incomplete microdata to address this issue. First, we study the properties of missing value pollution on generalization. Guided by these properties, we develop two top-down anonymization algorithms to preserve data utility on incomplete microdata. Extensive experiments on real-world datasets show that our techniques outperform the state-of-the-art techniques in terms of information loss and missing value pollution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. Mondrian, Enhanced Mondrian and semi-partition.

  2. Downloadable at http://archive.ics.uci.edu/ml/datasets/Adult.

  3. Downloadable at https://sites.google.com/site/informsdataminingcontest/.

  4. According to the documents provided by UCI and INFORMS, ‘?’ in Adult data and -1, -7, -8, -9 in INFORMS are considered as missing values.

  5. We assume age range is [1, 100], and Zipcode range is [10001, 50000].

References

  1. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Survey 42, 14:1–14:53 (2010). doi:10.1145/1749603.1749605

    Article  Google Scholar 

  2. Markkula, J.: Dynamic geographic personal data—new opportunity and challenge introduced by the location-aware mobile networks. Cluster Comput. 4(4), 369–377 (2001)

    Article  Google Scholar 

  3. Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  4. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE’06: Proceedings of the 22nd International Conference on Data Engineering, p. 25. IEEE Computer Society, Washington, DC (2006)

  5. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.-C.: Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD’06, pp. 785–790. ACM, New York (2006). doi:10.1145/1150402.1150504

  6. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference on Very Large Data Bases, ser. VLDB’07. VLDB Endowment, pp. 758–769 (2007). Available http://portal.acm.org/citation.cfm?id=1325851.1325938

  7. Nergiz, M., Clifton, C., Nergiz, A.: Multirelational k-anonymity. IEEE Trans. Knowl. Data Eng. 21(8), 1104–1117 (2009)

    Article  Google Scholar 

  8. Gong, Q., Luo, J., Yang, M.: Aim: a new privacy preservation algorithm for incomplete microdata based on anatomy. In: Proceedings of the 2012 International Conference on Pervasive Computing and the Networked World, ser. ICPCA/SWS’12, pp. 194–208. Springer, Berlin (2013). doi:10.1007/978-3-642-37015-1_16

  9. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  10. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 3 (2007)

    Article  Google Scholar 

  11. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering (ICDE), IEEE, pp. 106–115 (2007)

  12. Cao, J., Karras, P.: Publishing microdata with a robust privacy guarantee. Proc. VLDB Endow. 5(11), 1388–1399 (2012). doi:10.14778/2350229.2350255

    Article  Google Scholar 

  13. Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008). doi:10.1145/1453856.1453874

    Article  Google Scholar 

  14. Gong, Q., Luo, J., Yang, M., Ni, W., Li, X.-B.: Anonymizing 1:m microdata with high utility. Knowl. Based Syst. 115, 15–26 (2017)

    Article  Google Scholar 

  15. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: PODS’04: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM, New York (2004)

  16. Xiao, X., Yi, K., Tao, Y.: The hardness and approximation algorithms for l-diversity. In EDBT’10: Proceedings of the 13th International Conference on Extending Database Technology, pp. 135–146. ACM, New York (2010)

  17. He, Y., Naughton, J.F.: Anonymization of set-valued data via top-down, local generalization. Proc. VLDB Endow. 2(1), 934–945 (2009)

    Article  Google Scholar 

  18. Zakerzadeh, H., Aggarwal, C.C., Barker, K.: Privacy-preserving big data publishing. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM’15, pp. 26:1–26:11. ACM, New York (2015). doi:10.1145/2791347.2791380

  19. Ni, W., Chong, Z.: Clustering-oriented privacy-preserving data publishing. Knowl. Based Syst. 35, 264–270 (2012)

    Article  Google Scholar 

  20. Guo, K., Zhang, Q.: Fast clustering-based anonymization approaches with time constraints for data streams. Knowl. Based Syst. 46, 95–108 (2013)

  21. Bhuyan, H.K., Kamila, N.K.: Privacy preserving sub-feature selection based on fuzzy probabilities. Cluster Comput. 17(4), 1383–1399 (2014)

    Article  Google Scholar 

  22. Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in privacy preserving data publishing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD’10, pp. 747–758. ACM, New York (2010). doi:10.1145/1807167.1807248

  23. Xue, M., Karras, P., Raïssi, C., Vaidya, J., Tan, K.-L.: Anonymizing set-valued data by nonreciprocal recoding. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD’12, pp. 1050–1058. ACM, New York (2012). doi:10.1145/2339530.2339696

  24. Doka, K., Xue, M., Tsoumakos, D., Karras, P.: k-anonymization by freeform generalization. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, ser. ASIA CCS’15, pp. 519–530. ACM, New York (2015). doi:10.1145/2714576.2714590

  25. Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  26. Brown, M.L., Kros, J.F.: Data mining and the impact of missing data. Ind. Manag. Data Syst. 103(8), 611–621 (2003)

    Article  Google Scholar 

  27. Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M., Tan, C. (eds.) Transactions on Computational Science I, ser. Lecture Notes in Computer Science, vol. 4750, pp. 128–138. Springer, Berlin (2008). doi:10.1007/978-3-540-79299-4_7

  28. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)

    Article  Google Scholar 

  29. Zhang, X., Leckie, C., Dou, W., Chen, J., Kotagiri, R., Salcic, Z.: Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM’16, pp. 1793–1802. ACM, New York (2016). doi:10.1145/2983323.2983841

  30. Chen, B., Tan, C., Zou, X.: Cloud service platform of electronic identity in cyberspace. Cluster Comput. 1–13 (2017). doi:10.1007/s10586-017-0731-9

  31. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In SIGMOD’05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49–60. ACM, New York (2005)

  32. Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) (2013)

  33. Bayardo, R.J., Agrawal, R.: Data Privacy Through Optimal k-Anonymization. IEEE Computer Society, Los Alamitos (2005)

    Book  Google Scholar 

  34. Byun, J.-W., Kamra, A., Bertino, E., Li, N.: Efficient k-anonymization using clustering techniques. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications, ser. DASFAA’07, pp. 188–200. Springer, Berlin (2007)

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under Grants No. 61572130, 61632008, 61320106007, 61502100 and 61402104, Jiangsu Provincial Natural Science Foundation under Grants BK20150628, BK20140648 and BK20150637, Jiangsu Provincial Key Technology R&D Program under Grant BE2014603, Qing Lan Project of Jiangsu Province, Jiangsu Provincial Key Laboratory of Network and Information Security under Grant BM2003201, and Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant 93K-9.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junzhou Luo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gong, Q., Yang, M., Chen, Z. et al. A framework for utility enhanced incomplete microdata anonymization. Cluster Comput 20, 1749–1764 (2017). https://doi.org/10.1007/s10586-017-0795-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0795-6

Keywords

Navigation