Abstract
Privacy protection issues for resource description frameworks (RDFs) have emerged over the use of public government open data and the healthcare data of individuals. As these data may include personal information, they must undergo a de-identification process that deletes or replaces parts of the original data. To enable these protections, a method has been developed to apply k-anonymization to RDF data. However, sensitive RDF information anonymized using k-anonymization is not completely secure and is vulnerable to attacks. In this paper, we propose an l-diversity anatomy de-identification method that can overcome the limitations of k-anonymity and guarantee stronger privacy protection than k-anonymization. Further, as this data anonymization process is computationally time-intensive, we use Spark distributed computing to provide rapid de-identification to enhance its utility. We also propose l-diversity preservation for dynamically evolving RDF datasets. Experimental results show that our proposed distributed l-diversity algorithm processes the data more efficiently than conventional approaches.
Similar content being viewed by others
References
Malik KR, Sam Y, Hussain M, Abuarqoub A (2018) A methodology for real-time data sustainability in smart city: Towards inferencing and analytics for big-data. SCS 39:548–556
Jo J, Sharma PK, Sicato JCS, Park JH (2019) Emerging technologies for sustainable smart city network security: Issues, challenges, and countermeasures. JIPS 15(4):765–784
Yin C, Zhou B, Yin Z, Wang J (2019) Local privacy protection classification based on human-centric computing. HCIS 9(33):1–14
Perez AJ, Zeadally S, Jabeur N (2018) Security and privacy in ubiquitous sensor networks. JIPS 14(2):286–308
Lee J, Jung J, Park P, Chung S, Cha H (2018) Design of a human-centric de-identification framework for utilizing various clinical research data. HCIS 8(19):1–12
Sweeney L (2002) k-anonymity: a model for protecting privacy. IJUFKS 10(05):557–570
Machanavajjhala A, et al. (2006) ℓ-diversity: Privacy beyond k-anonymity. In 22nd international conference on data engineering (ICDE’06). IEEE pp. 24–24
Radulovic F, García Castro R, Gómez-Pérez A (2015) Towards the anonymisation of RDF data. In: 27th International conference on software engineering and knowledge engineering
Temuujin O et al (2019) SPARK-based partitioning algorithm for k-Anonymization of large RDFs. Advanced multimedia and ubiquitous engineering. Springer, Singapore, pp 292–298
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, pp. 139–150
Zaharia M et al (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Jeon M, Temuujin O, Shin Y, Ahn J, Im D (2019) L-RDFDiversity: distributed de-identification for large RDF data with Spark. In: Proceedings of the CUTE, Macau, China
Li N, Li T, Venkatasubramanian S (2007) T-closeness: Privacy beyond k-anonymity and l-diversity, In: 2007 IEEE 23rd international conference on data engineering, IEEE, pp. 106–115
Heitmann B, Hermsen F, Decker S (2017) k-rdf-neighbourhood anonymity: Combining structural and attribute-based anonymization for linked data. In: PrivOn@ISWC, 1951
Saripalle R, Algarin A, Ziminski T (2015) Towards knowledge level privacy and security using RDF/RDFS and RBAC. In: Proceedings of the 2015 9th International Conference on Semantic Computing
Ahn J, Im D (2020) Efficient access control of large scale RDF data using prefix-based labeling. IEEE Access 8:122405–122412
Klyne G, Carroll JJ, McBride B (2019) Resource description framework (RDF): concepts and abstract syntax. W3C Recommendation, [online] http://www.w3.org/TR/rdf-concepts
Wilkinson K (2006) Jena property table implementation. In: Proceedings of SWWS
Mallea A, et al. (2011) On blank nodes. In: International semantic web conference. pp. 421–437
Im D, Lee S, Kim H (2012) A version management framework for RDF triple stores. IJSEKE 22:85–106
Wang P, Wang J (2013) L-diversity algorithm for incremental data release. AMIS 7:2055
Sun X, Wang H, Li J (2008) L-diversity based dynamic update for large time-evolving microdata. In: Australasian joint conference on artificial intelligence, pp. 461–469
Temuujin O, Ahn J, Im DH (2019) Efficient L-diversity algorithm for preserving privacy of dynamically published datasets. IEEE Access 7:122878–122888
GitHub (2020) rvesse/lubm-uba. [online] https://github.com/rvesse/lubm-uba [Accessed 10 Jan. 2020]
Acknowledgments
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) supported program (IITP-2020-2018-0-01417) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation); and by the Basic Science Research Program through NRF, funded by the Ministry of Education (No. NRF-2018R1D1A1B07048380).
Author information
Authors and Affiliations
Contributions
MinHyuk Jeon and Dong-Hyuk Im conceived the problem and supervised the overall research; Jinhyun Ahn clarified some points that helped Dong-Hyuk Im to develop the algorithm; Odsuren Temuujin and MinHyuk Jeon implemented the algorithm and performed the experiments; MinHyuk Jeon, Odsuren Temuujin, and Dong-Hyuk Im wrote the paper.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jeon, M., Temuujin, O., Ahn, J. et al. Distributed L-diversity using spark-based algorithm for large resource description frameworks data. J Supercomput 77, 7270–7286 (2021). https://doi.org/10.1007/s11227-020-03583-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03583-6