Abstract
Scalable data processing platforms built on cloud computing are becoming increasingly attractive as infrastructure for supporting big data mining and analytics applications. But privacy concerns are one of the major obstacles to make use of public cloud platforms. Practically, data generalisation is a widely adopted anonymisation technique for data privacy preservation in data publishing or sharing scenarios. Multidimensional anonymisation, a global-recoding generalisation scheme, has been a recent focus due to its capability of balancing data obfuscation and data usability. Existing approaches handled the scalability problem of multidimensional anonymisation for data sets much larger than main memory by storing data on disk at runtime, which incurs an impractical serial I/O cost. In this paper, we propose a scalable iterative multidimensional anonymisation approach for big data sets based on MapReduce, a state-of-the-art large-scale data processing paradigm. Our basic and intuitive idea is to partition a large data set recursively into smaller data partitions using MapReduce until all partitions can fit in memory of each computing node. A tree indexing structure is proposed to achieve recursive computation on MapReduce for data partitioning in multidimensional anonymisation. Experimental results on real-life data sets demonstrate that the proposed approach can significantly improve the scalability and time-efficiency of multidimensional anonymisation over existing approaches, and therefore is applicable to big data applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chaudhuri, S.: What next?: a half-dozen data management research goals for big data and the cloud. In: Proceedings of the PODS 2012, pp. 1–4 (2012)
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explor. Newsl. 14(2), 1–5 (2013)
Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the SIGKDD 2011, pp. 690–698 (2011)
Fung, B., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14 (2010)
Fung, B.C., Wang, K., Yu, P.S.: Anonymizing classification data for privacy preservation. IEEE TKDE 19(5), 711–725 (2007)
Gehrke, J., Ramakrishnan, R., Ganti, V.: Rainforest-a framework for fast decision tree construction of large datasets. In: Proceedings of the VLDB 1998, pp. 416–427 (1998)
Iwuchukwu, T., Naughton, J.F.: K-anonymization as spatial indexing: toward scalable and incremental anonymization. In: Proceedings of the VLDB 2007, pp. 746–757 (2007)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the ICDE 2006, p. 25 (2006)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Workload-aware anonymization techniques for large-scale datasets. ACM TODS 33(3), 17 (2008)
Lin, J., Ryaboy, D.: Scaling big data mining infrastructure: the twitter experience. ACM SIGKDD Explor. Newslett. 14(2), 6–19 (2013)
Mohammed, N., Fung, B., Hung, P.C., Lee, C.K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM TKDD 4(4), 18 (2010)
Sweeney, L.: \(k\)-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness 10(05), 557–570 (2002)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE TKDE 26(1), 97–107 (2014)
Xiao, X., Tao, Y.: Personalized privacy preservation. In: Proceedings of the SIGMOD 2006, pp. 229–240 (2006)
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.C.: Utility-based anonymization using local recoding. In: Proceedings of the SIGKDD 2006, pp. 785–790 (2006)
Zhang, X., Dou, W., Pei, J., Nepal, S., Yang, C., Liu, C., Chen, J.: Proximity-aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud. IEEE Trans. Comput. PP(99) (2014)
Zhang, X., Yang, C., Nepal, S., Liu, C., Dou, W., Chen, J.: A mapreduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud. In: Proceedings of the 3rd International Conference on Cloud and Green Computing (CGC2013), pp. 105–112 (2013)
Zhang, X., Yang, L.T., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE TPDS 25(2), 363–373 (2014)
Acknowledgments
This paper is partially supported by Open Project of State Key Laboratory for Novel Software Technology (No. KFKT2015A03), Natural Science Foundation of China (No. 61402258), China Postdoctoral Science Foundation (No. 2015M571739), Open Project of State Key Laboratory for Novel Software Technology (No. KFKT2016B22).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Zhang, X., Qi, L., He, Q., Dou, W. (2016). Scalable Iterative Implementation of Mondrian for Big Data Multidimensional Anonymisation. In: Wang, G., Ray, I., Alcaraz Calero, J., Thampi, S. (eds) Security, Privacy and Anonymity in Computation, Communication and Storage. SpaCCS 2016. Lecture Notes in Computer Science(), vol 10067. Springer, Cham. https://doi.org/10.1007/978-3-319-49145-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-49145-5_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49144-8
Online ISBN: 978-3-319-49145-5
eBook Packages: Computer ScienceComputer Science (R0)