Optimizing data privacy: an RFD-based approach to anonymization strategy selection

Sadeghi-Nasab, Alireza; Rahmani, Mohsen

doi:10.1007/s11227-024-06642-4

Optimizing data privacy: an RFD-based approach to anonymization strategy selection

Published: 06 November 2024

Volume 81, article number 134, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

136 Accesses
Explore all metrics

Abstract

This paper presents a novel Anonymization Strategy Selection Framework that combines relaxed functional dependencies (RFDs) and particle swarm optimization (PSO) to balance data privacy and utility. Our approach extracts RFDs from datasets, generates diverse anonymization strategies using domain generalization hierarchies, and employs PSO for strategy optimization. We introduce a fitness function that balances k-anonymity and information loss. The framework's innovation lies in using RFDs to capture fine-grained data dependencies, enabling more nuanced anonymization. Evaluation on widely used UCI machine learning repository datasets show our framework outperforms existing techniques, achieving higher k-anonymity levels with lower information loss. Our adaptive approach generates hybrid strategies combining elements from multiple RFDs, resulting in superior privacy-utility trade-offs. This research advances privacy-preserving data publishing by providing a flexible, effective tool for generating anonymized datasets that maintain high utility for downstream analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Information-Driven Genetic Algorithm for Privacy-Preserving Data Publishing

Privacy-preserving data publishing: an information-driven distributed genetic algorithm

Article Open access 15 January 2024

P-IRON for Privacy Preservation in Data Mining

Data availability

No datasets were generated or analyzed during the current study.

Notes

https://github.com/AlirezaSN/data-anonymization-framework.

References

Yuvaraj N, Praghash K, Karthikeyan T (2022) Privacy preservation of the user data and properly balancing between privacy and utility. Int J Bus Intell Data Min 20(4):394–411
Google Scholar
Pujol D, McKenna R, Kuppam S, Hay M, Machanavajjhala A and Miklau G (2020) Fair decision making using privacy-protected data. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 189–199
Sweeney L (2002) k-anonymity: a model for protecting privacy. Internat J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570
Article MathSciNet Google Scholar
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity,". ACM Trans Knowl Discov From Data (TKDD) 1(1):3
Article Google Scholar
Li N, Li T and Venkatasubramanian S (2006) t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115
Farrand T, Mireshghallah F, Singh S and Trask A (2020) Neither private nor fair: impact of data imbalance on utility and fairness in differential privacy. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning In Practice, pp. 15–19
Dong J, Roth A, Su WJ (2022) Gaussian differential privacy. J Royal Stat Soc: Ser B (Stat Methodol) 84(1):3–37
Article MathSciNet Google Scholar
Yin X, Zhu Y, Hu J (2021) A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput Surv (CSUR) 54(6):1–36
Article Google Scholar
Olatunji IE, Rauch J, Katzensteiner M, Khosla M (2022) A review of anonymization for healthcare data. Big data. https://doi.org/10.1089/big.2021.0169
Article Google Scholar
Wang J, Shen HT, Song J and Ji J (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927
Kennedy J and Eberhart R (1995) Particle swarm optimization. Proceedings of ICNN'95-International Conference on Neural Networks 4:1942–1948
"Adult", UCI machine learning repository, [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. [Accessed 2022]
"Statlog (German credit data)," [Online]. Available: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data. [Accessed 2024]
Motwani R and Xu Y (2007) Efficient algorithms for masking and finding quasi-identifiers. In Proceedings of The Conference on Very Large Data Bases (VLDB), pp. 83–93
Mahanan W, Chaovalitwongse WA, Natwichai J (2021) Data privacy preservation algorithm with k-anonymity. World Wide Web 24(5):1551–1561
Article Google Scholar
Kumar BS, Daniya T, Sathya N and Cristin R (2020) Investigation on privacy preserving using K-anonymity techniques. In International Conference on Computer Communication and Informatics (ICCCI), IEEE, pp. 1–7
Parameshwarappa P, Chen Z, Koru G (2021) Anonymization of daily activity data by using ℓ-diversity privacy model. ACM Trans Manag Inf Syst (TMIS) 12(3):1–23
Article Google Scholar
Gangarde R, Sharma A, Pawar A, Joshi R, Gonge S (2021) Privacy preservation in online social networks using multiple-graph-properties-based clustering to ensure k-anonymity, l-diversity, and t-closeness. Electronics 10(22):2877
Article Google Scholar
Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442
Article MathSciNet Google Scholar
Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell 45(4):4396–4415
Google Scholar
Ichihashi S (2020) Online privacy and information disclosure by consumers. Am Econ Rev 110(2):569–595
Article Google Scholar
Nisha, Singhal A and Muttoo SK (2022) Anonymization of multi-relations datasets using single table algorithms. In information and communication technology for competitive strategies (ICTCS 2021) intelligent strategies for ICT, Springer, Singapore, pp. 21–30
Dhinakaran D, Prathap PJ (2022) Protection of data privacy from vulnerability using two-fish technique with Apriori algorithm in data mining. J Supercomput 78(16):17559–17593
Article Google Scholar
T. Papenbrock and F. Naumann (2016) A hybrid approach to functional dependency discovery. In proceedings of the 2016 International Conference on Management of Data, pp. 821–833
Caruccio L, Deufemia V, Polese G (2020) Mining relaxed functional dependencies from data. Data Min Knowl Disc 34(2):443–477
Article MathSciNet Google Scholar
Shami TM, El-Saleh AA, Alswaitti M, Al-Tashi Q, Summakieh MA, Mirjalili S (2022) Particle swarm optimization: a comprehensive survey. IEEE Access 10:10031–10061
Article Google Scholar
Sweeney L (1997) Guaranteeing anonymity when sharing medical data, the Datafly System. In Proceedings of the Amia Annual Fall Symposium, American Medical Informatics Association, p. 51.
LeFevre K, DeWitt DJ and Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49–60
Xiao X and Tao Y (2006) Personalized privacy preservation. In proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 229–240
LeFevre K, DeWitt DJ and Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In 22nd International Conference on Data Engineering (ICDE'06), IEEE, pp. 25–25
Wong RC-W, Li J, Fu AW-C and Wang K (2006) (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, 2006, pp. 754–759
Ghinita G, Karras P, Kalnis P and Mamoulis N(2007) Fast data anonymization with low information loss. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 758–769
Fung BC, Wang K and Yu PS (2005) Top-down specialization for information and privacy preservation. In 21st International Conference on Data Engineering (ICDE'05), IEEE, pp. 205–216
Wei K, Li J, Ding M, Ma C, Yang HH, Farokhi F, Jin S, Quek TQ, Poor HV (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Trans Inf Forensics Secur 15:3454–3469
Article Google Scholar
Madan S, Goswami P (2021) Adaptive privacy preservation approach for big data publishing in cloud using k-anonymization. Recent Adv Comput Sci Commun (Former: Recent Pat Comput Sci) 14(8):2678–2688
Article Google Scholar
Madan S, Goswami P (2021) A technique for securing big data using k-anonymization with a hybrid optimization algorithm. Int J Operations Res Inf Syst (IJORIS) 12(4):1–21
Article Google Scholar
Caruccio L, Desiato D, Polese G, Tortora G, Zannone N (2022) A decision-support framework for data anonymization with application to machine learning processes. Inf Sci 613:1–32
Article Google Scholar
Sahana LR, Ranganatha HR (2022) An enhanced data anonymization approach for privacy preserving data publishing in cloud computing based on genetic chimp optimization. Int J Inf Secur Priv (IJISP) 16(1):1–16
Article Google Scholar
Sai Kumar S, Reddy AR, Krishna BS, Rao JN, Kiran A (2022) Privacy preserving with modified grey wolf optimization over big data using optimal K anonymization approach. J Interconnect Netw 22(Supp01):2141039
Article Google Scholar
Jha N, Vassio L, Trevisan M, Leonardi E, Mellia M (2023) Practical anonymization for data streams: z-anonymity and relation with k-anonymity. Perform Eval 159:102329
Article Google Scholar
Ashkouti F, Khamforoosh K (2023) A distributed computing model for big data anonymization in the networks. PLoS One 18(4):e0285212
Article Google Scholar
Patil RA, Patil PD (2024) Efficient approximation and privacy preservation algorithms for real time online evolving data streams. World Wide Web. https://doi.org/10.1007/s11280-024-01244-9
Article Google Scholar
"Python", [Online]. Available: https://www.python.org. [Accessed 2024]
"Pandas", [Online]. Available: https://pandas.pydata.org. [Accessed 2024]
"Numpy", [Online]. Available: https://numpy.org. [Accessed 2024]
"scikit-learn", [Online]. Available: https://scikit-learn.org/stable/. [Accessed 2024]
"Deap", [Online]. Available: https://deap.readthedocs.io/en/master/. [Accessed 2024]

Download references

Author information

Authors and Affiliations

Computer Engineering Group, Faculty of Engineering, Arak University, Arak, Iran
Alireza Sadeghi-Nasab & Mohsen Rahmani

Authors

Alireza Sadeghi-Nasab
View author publications
You can also search for this author inPubMed Google Scholar
Mohsen Rahmani
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Alireza Sadeghi-Nasab contributed to conceptualization, methodology, software, data curation, visualization, writing—original draft preparation. Mohsen Rahmani was involved in supervision, writing—reviewing and editing, validation.

Corresponding author

Correspondence to Alireza Sadeghi-Nasab.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

DGHs for adult dataset

See Figs.

4,

5,

6,

7,

8,

9 and

10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sadeghi-Nasab, A., Rahmani, M. Optimizing data privacy: an RFD-based approach to anonymization strategy selection. J Supercomput 81, 134 (2025). https://doi.org/10.1007/s11227-024-06642-4

Download citation

Accepted: 21 October 2024
Published: 06 November 2024
DOI: https://doi.org/10.1007/s11227-024-06642-4

Keywords

Profiles

Alireza Sadeghi-Nasab View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing data privacy: an RFD-based approach to anonymization strategy selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Information-Driven Genetic Algorithm for Privacy-Preserving Data Publishing

Privacy-preserving data publishing: an information-driven distributed genetic algorithm

P-IRON for Privacy Preservation in Data Mining

Data availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

DGHs for adult dataset

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now