DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Ashkouti, Farough; Khamforoosh, Keyhan; Sheikhahmadi, Amir; Khamfroush, Hana

doi:10.1007/s11227-021-03958-3

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Published: 05 July 2021

Volume 78, pages 2616–2650, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Farough Ashkouti¹,
Keyhan Khamforoosh ORCID: orcid.org/0000-0003-0792-0523¹,
Amir Sheikhahmadi¹ &
…
Hana Khamfroush²

509 Accesses
Explore all metrics

Abstract

One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the ℓ-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for $\ell$-diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ℓdiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the ℓ-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

K-Anonymity Algorithm Based on Improved Clustering

A Spark-based Apriori algorithm with reduced shuffle overhead

Article 27 March 2020

References

Singh AP, Parihar MD (2013) A review of privacy preserving data publishing technique. Int J Emerg Res Manag Technol 2(6):32–38
Google Scholar
Sweeney L (2000) Simple demographics often identify people uniquely, Carnegie Mellon Univ. Data Priv. Work. Pap. 3. Pittsburgh 671: 1–34
Zigomitros A, Casino F, Solanas A, Patsakis C (2020) A Survey on privacy properties for data publishing of relational data. IEEE Access 8:51071–51099
Article Google Scholar
de Montjoye Y-A, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the Crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376
Article Google Scholar
Jain P, Gyanchandani M, Khare N (2016) Big data privacy: a technological perspective and review. J Big Data 3(1):25
Article Google Scholar
Yu S (2016) Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access 4:2751–2763
Article Google Scholar
Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S (2016) Protection of big data privacy. IEEE Access 4:1821–1834
Article Google Scholar
Clifton C, Tassa T (2013) On syntactic anonymity and differential privacy. Trans Data Priv 6(2):161–183
MathSciNet Google Scholar
Xu L, Jiang C, Wang J, Yuan J, Ren Y (2014) Information security in big data: privacy and data mining. IEEE Access 2:1151–1178
Google Scholar
Sweeney L (2002) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(05):557–570
Article MathSciNet MATH Google Scholar
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov from Data 1(1):3-es
Article Google Scholar
Ninghui L, Tiancheng L, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and ℓ-diversity. In: Proceedings - International Conference on Data Engineering: pp 106–115
Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets.”In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 689–700
Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 665–676
Abdelhameed SA, Moussa SM, Khalifa ME (2018) Privacy-preserving tabular data publishing: a comprehensive evaluation from web to cloud. Comput Secur 72:74–95
Article Google Scholar
Victor N, Lopez D, Abawajy JH (2016) Privacy models for big data: a survey. Int J Big Data Intell 3(1):61–75
Article Google Scholar
Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):14
Article Google Scholar
Ali M, Khan SU, Vasilakos AV (2015) Security in cloud computing: opportunities and challenges. Inf Sci (Ny) 305:357–383
Article MathSciNet Google Scholar
Meier A, Kaufmann M (2019) Nosql databases. In: Meier A, Kaufmann M (eds) SQL & NoSQL Databases. Springer, Berlin, pp 201–218
Chapter Google Scholar
Apache software foundation, Apache Spark home page. https://spark.apache.org/
Zaharia M et al (2016) Apache spark. Commun ACM 59(11):56–65
Article Google Scholar
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on Apache Spark. Int J Data Sci Anal 1(3):145–164
Article Google Scholar
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark. O’Reilly Media
Google Scholar
Guller M (2015) Big data analytics with spark. Apress, Berkeley
Book Google Scholar
Canbay Y, Saǧiroǧlu S (2017) Big data anonymization with spark. In 2nd International Conference on Computer Science and Engineering, UBMK 2017, pp 833–838
Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: 2010 Third International Symposium on intelligent information technology and security informatics, pp 63–67
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Rashidi R, Khamforoosh K, Sheikhahmadi A (2020) An analytic approach to separate users by introducing new combinations of initial centers of clustering. Phys A Stat Mech Appl 1(551):124185
Article Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf theory 28(2):129–137
Article MathSciNet MATH Google Scholar
LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional K-anonymity. Proc Int Conf Data Eng 2006:25
Google Scholar
Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C (2006) Utility-based anonymization for privacy preservation with less information loss. Acm Sigkdd Explor Newsl 8(2):21–30
Article Google Scholar
Li J, Wong RC-W, Fu AW-C, Pei J (2008) Anonymization by local recoding in data with attribute hierarchical taxonomies. IEEE Trans Knowl Data Eng 20(9):1181–1194
Article Google Scholar
Aggarwal G et al (2010) Achieving anonymity via clustering. ACM Trans Algorithms 6(3):1–19
Article MathSciNet MATH Google Scholar
Zheng W, Ma Y, Wang Z, Jia C, Li P (2019) Effective L-diversity anonymization algorithm based on improved clustering. In: International Symposium on Cyberspace Safety and Security, pp 318–329
LeFevre K, DJDJ DeWitt, R Ramakrishnan, (2005) Incognito: efficient full-domain K-anonymity SIGMOD ’05 Proc. 2005 ACM SIGMOD Int Conf Manag Data, pp 49–60
Yaseen S et al (2018) Improved generalization for secure data publishing. IEEE Access 6:27156–27165
Article Google Scholar
Temuujin O, Ahn J, Im D (2019) Efficient L-diversity algorithm for preserving privacy of dynamically published datasets. IEEE Access 7:122878–122888
Article Google Scholar
Li T, Li N, Zhang J, Molloy I (2012) Slicing: a new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng 24(3):561–574
Article Google Scholar
Jin X, Wah BW, Cheng X, Wang Y (2015) Significance and challenges of big data research. Big Data Res 2(2):59–64
Article Google Scholar
Zhang X, Leckie C, Dou W, Chen J, Kotagiri R, Salcic Z (2016) Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM ’16, pp 1793–1802
Zhang X, Yang LT, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25(2):363–373
Article Google Scholar
Zhang X et al (2015) Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307
Article MathSciNet MATH Google Scholar
Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J (2014) A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. J Comput Syst Sci 80(5):1008–1020
Article MathSciNet MATH Google Scholar
Zakerzadeh H, Aggarwal CC, Barker K (2015) Privacy-preserving big data publishing, Proc. 27th Int Conf Sci Stat Database Manag. - SSDBM ’15, pp 1–11
Ashkouti F, Sheikhahmadi A (2021) DI-Mondrian: distributed improved mondrian for satisfaction of the L-diversity privacy model using apache spark. Inf Sci (Ny) 546:1–24
Article Google Scholar
Al-Zobbi M, Shahrestani S, Ruan C (2017) Improving MapReduce privacy by implementing multi-dimensional sensitivity-based anonymization. J Big Data 4(1):45
Article Google Scholar
Jain P, Gyanchandani M, Khare N (2019) Enhanced secured Map Reduce layer for big data privacy and security. J Big Data 6(1):30
Article Google Scholar
Nayahi JJV, Kavitha V (2017) Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Futur Gener Comput Syst 74:393–408
Article Google Scholar
Bazai SU, Jang-Jaccard J, Alavizadeh H (2021) Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10(5):589
Article Google Scholar
IPUMS USA, University of Minnesota. https://usa.ipums.org/usa/
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Sinwar D, Kaushik R (2014) Study of Euclidean and Manhattan distance metrics using simple k-means clustering. Int J Res Appl Sci Eng Technol 2(5):270–274
Google Scholar
Liberti L, Lavor C, Maculan N, Mucherino A (2014) Euclidean distance geometry and applications. SIAM Rev 56(1):3–69
Article MathSciNet MATH Google Scholar
University of california at Irvine, UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.php

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
Farough Ashkouti, Keyhan Khamforoosh & Amir Sheikhahmadi
Department of Computer Science, University of Kentucky, Lexington, USA
Hana Khamfroush

Authors

Farough Ashkouti
View author publications
You can also search for this author inPubMed Google Scholar
Keyhan Khamforoosh
View author publications
You can also search for this author inPubMed Google Scholar
Amir Sheikhahmadi
View author publications
You can also search for this author inPubMed Google Scholar
Hana Khamfroush
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Keyhan Khamforoosh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ashkouti, F., Khamforoosh, K., Sheikhahmadi, A. et al. DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark. J Supercomput 78, 2616–2650 (2022). https://doi.org/10.1007/s11227-021-03958-3

Download citation

Accepted: 17 June 2021
Published: 05 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03958-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

K-Anonymity Algorithm Based on Improved Clustering

A Spark-based Apriori algorithm with reduced shuffle overhead

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now