Abstract
One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the ℓ-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for \(\ell\)-diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ℓdiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the ℓ-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods










Similar content being viewed by others
References
Singh AP, Parihar MD (2013) A review of privacy preserving data publishing technique. Int J Emerg Res Manag Technol 2(6):32–38
Sweeney L (2000) Simple demographics often identify people uniquely, Carnegie Mellon Univ. Data Priv. Work. Pap. 3. Pittsburgh 671: 1–34
Zigomitros A, Casino F, Solanas A, Patsakis C (2020) A Survey on privacy properties for data publishing of relational data. IEEE Access 8:51071–51099
de Montjoye Y-A, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the Crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376
Jain P, Gyanchandani M, Khare N (2016) Big data privacy: a technological perspective and review. J Big Data 3(1):25
Yu S (2016) Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access 4:2751–2763
Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S (2016) Protection of big data privacy. IEEE Access 4:1821–1834
Clifton C, Tassa T (2013) On syntactic anonymity and differential privacy. Trans Data Priv 6(2):161–183
Xu L, Jiang C, Wang J, Yuan J, Ren Y (2014) Information security in big data: privacy and data mining. IEEE Access 2:1151–1178
Sweeney L (2002) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(05):557–570
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov from Data 1(1):3-es
Ninghui L, Tiancheng L, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and ℓ-diversity. In: Proceedings - International Conference on Data Engineering: pp 106–115
Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets.”In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 689–700
Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 665–676
Abdelhameed SA, Moussa SM, Khalifa ME (2018) Privacy-preserving tabular data publishing: a comprehensive evaluation from web to cloud. Comput Secur 72:74–95
Victor N, Lopez D, Abawajy JH (2016) Privacy models for big data: a survey. Int J Big Data Intell 3(1):61–75
Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):14
Ali M, Khan SU, Vasilakos AV (2015) Security in cloud computing: opportunities and challenges. Inf Sci (Ny) 305:357–383
Meier A, Kaufmann M (2019) Nosql databases. In: Meier A, Kaufmann M (eds) SQL & NoSQL Databases. Springer, Berlin, pp 201–218
Apache software foundation, Apache Spark home page. https://spark.apache.org/
Zaharia M et al (2016) Apache spark. Commun ACM 59(11):56–65
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on Apache Spark. Int J Data Sci Anal 1(3):145–164
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark. O’Reilly Media
Guller M (2015) Big data analytics with spark. Apress, Berkeley
Canbay Y, Saǧiroǧlu S (2017) Big data anonymization with spark. In 2nd International Conference on Computer Science and Engineering, UBMK 2017, pp 833–838
Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: 2010 Third International Symposium on intelligent information technology and security informatics, pp 63–67
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Rashidi R, Khamforoosh K, Sheikhahmadi A (2020) An analytic approach to separate users by introducing new combinations of initial centers of clustering. Phys A Stat Mech Appl 1(551):124185
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf theory 28(2):129–137
LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional K-anonymity. Proc Int Conf Data Eng 2006:25
Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C (2006) Utility-based anonymization for privacy preservation with less information loss. Acm Sigkdd Explor Newsl 8(2):21–30
Li J, Wong RC-W, Fu AW-C, Pei J (2008) Anonymization by local recoding in data with attribute hierarchical taxonomies. IEEE Trans Knowl Data Eng 20(9):1181–1194
Aggarwal G et al (2010) Achieving anonymity via clustering. ACM Trans Algorithms 6(3):1–19
Zheng W, Ma Y, Wang Z, Jia C, Li P (2019) Effective L-diversity anonymization algorithm based on improved clustering. In: International Symposium on Cyberspace Safety and Security, pp 318–329
LeFevre K, DJDJ DeWitt, R Ramakrishnan, (2005) Incognito: efficient full-domain K-anonymity SIGMOD ’05 Proc. 2005 ACM SIGMOD Int Conf Manag Data, pp 49–60
Yaseen S et al (2018) Improved generalization for secure data publishing. IEEE Access 6:27156–27165
Temuujin O, Ahn J, Im D (2019) Efficient L-diversity algorithm for preserving privacy of dynamically published datasets. IEEE Access 7:122878–122888
Li T, Li N, Zhang J, Molloy I (2012) Slicing: a new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng 24(3):561–574
Jin X, Wah BW, Cheng X, Wang Y (2015) Significance and challenges of big data research. Big Data Res 2(2):59–64
Zhang X, Leckie C, Dou W, Chen J, Kotagiri R, Salcic Z (2016) Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM ’16, pp 1793–1802
Zhang X, Yang LT, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25(2):363–373
Zhang X et al (2015) Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307
Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J (2014) A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. J Comput Syst Sci 80(5):1008–1020
Zakerzadeh H, Aggarwal CC, Barker K (2015) Privacy-preserving big data publishing, Proc. 27th Int Conf Sci Stat Database Manag. - SSDBM ’15, pp 1–11
Ashkouti F, Sheikhahmadi A (2021) DI-Mondrian: distributed improved mondrian for satisfaction of the L-diversity privacy model using apache spark. Inf Sci (Ny) 546:1–24
Al-Zobbi M, Shahrestani S, Ruan C (2017) Improving MapReduce privacy by implementing multi-dimensional sensitivity-based anonymization. J Big Data 4(1):45
Jain P, Gyanchandani M, Khare N (2019) Enhanced secured Map Reduce layer for big data privacy and security. J Big Data 6(1):30
Nayahi JJV, Kavitha V (2017) Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Futur Gener Comput Syst 74:393–408
Bazai SU, Jang-Jaccard J, Alavizadeh H (2021) Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10(5):589
IPUMS USA, University of Minnesota. https://usa.ipums.org/usa/
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Sinwar D, Kaushik R (2014) Study of Euclidean and Manhattan distance metrics using simple k-means clustering. Int J Res Appl Sci Eng Technol 2(5):270–274
Liberti L, Lavor C, Maculan N, Mucherino A (2014) Euclidean distance geometry and applications. SIAM Rev 56(1):3–69
University of california at Irvine, UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.php
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ashkouti, F., Khamforoosh, K., Sheikhahmadi, A. et al. DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark. J Supercomput 78, 2616–2650 (2022). https://doi.org/10.1007/s11227-021-03958-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03958-3