Abstract
This paper presents a novel Anonymization Strategy Selection Framework that combines relaxed functional dependencies (RFDs) and particle swarm optimization (PSO) to balance data privacy and utility. Our approach extracts RFDs from datasets, generates diverse anonymization strategies using domain generalization hierarchies, and employs PSO for strategy optimization. We introduce a fitness function that balances k-anonymity and information loss. The framework's innovation lies in using RFDs to capture fine-grained data dependencies, enabling more nuanced anonymization. Evaluation on widely used UCI machine learning repository datasets show our framework outperforms existing techniques, achieving higher k-anonymity levels with lower information loss. Our adaptive approach generates hybrid strategies combining elements from multiple RFDs, resulting in superior privacy-utility trade-offs. This research advances privacy-preserving data publishing by providing a flexible, effective tool for generating anonymized datasets that maintain high utility for downstream analysis.



Similar content being viewed by others
Data availability
No datasets were generated or analyzed during the current study.
References
Yuvaraj N, Praghash K, Karthikeyan T (2022) Privacy preservation of the user data and properly balancing between privacy and utility. Int J Bus Intell Data Min 20(4):394–411
Pujol D, McKenna R, Kuppam S, Hay M, Machanavajjhala A and Miklau G (2020) Fair decision making using privacy-protected data. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 189–199
Sweeney L (2002) k-anonymity: a model for protecting privacy. Internat J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity,". ACM Trans Knowl Discov From Data (TKDD) 1(1):3
Li N, Li T and Venkatasubramanian S (2006) t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115
Farrand T, Mireshghallah F, Singh S and Trask A (2020) Neither private nor fair: impact of data imbalance on utility and fairness in differential privacy. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning In Practice, pp. 15–19
Dong J, Roth A, Su WJ (2022) Gaussian differential privacy. J Royal Stat Soc: Ser B (Stat Methodol) 84(1):3–37
Yin X, Zhu Y, Hu J (2021) A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput Surv (CSUR) 54(6):1–36
Olatunji IE, Rauch J, Katzensteiner M, Khosla M (2022) A review of anonymization for healthcare data. Big data. https://doi.org/10.1089/big.2021.0169
Wang J, Shen HT, Song J and Ji J (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927
Kennedy J and Eberhart R (1995) Particle swarm optimization. Proceedings of ICNN'95-International Conference on Neural Networks 4:1942–1948
"Adult", UCI machine learning repository, [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. [Accessed 2022]
"Statlog (German credit data)," [Online]. Available: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data. [Accessed 2024]
Motwani R and Xu Y (2007) Efficient algorithms for masking and finding quasi-identifiers. In Proceedings of The Conference on Very Large Data Bases (VLDB), pp. 83–93
Mahanan W, Chaovalitwongse WA, Natwichai J (2021) Data privacy preservation algorithm with k-anonymity. World Wide Web 24(5):1551–1561
Kumar BS, Daniya T, Sathya N and Cristin R (2020) Investigation on privacy preserving using K-anonymity techniques. In International Conference on Computer Communication and Informatics (ICCCI), IEEE, pp. 1–7
Parameshwarappa P, Chen Z, Koru G (2021) Anonymization of daily activity data by using ℓ-diversity privacy model. ACM Trans Manag Inf Syst (TMIS) 12(3):1–23
Gangarde R, Sharma A, Pawar A, Joshi R, Gonge S (2021) Privacy preservation in online social networks using multiple-graph-properties-based clustering to ensure k-anonymity, l-diversity, and t-closeness. Electronics 10(22):2877
Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442
Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell 45(4):4396–4415
Ichihashi S (2020) Online privacy and information disclosure by consumers. Am Econ Rev 110(2):569–595
Nisha, Singhal A and Muttoo SK (2022) Anonymization of multi-relations datasets using single table algorithms. In information and communication technology for competitive strategies (ICTCS 2021) intelligent strategies for ICT, Springer, Singapore, pp. 21–30
Dhinakaran D, Prathap PJ (2022) Protection of data privacy from vulnerability using two-fish technique with Apriori algorithm in data mining. J Supercomput 78(16):17559–17593
T. Papenbrock and F. Naumann (2016) A hybrid approach to functional dependency discovery. In proceedings of the 2016 International Conference on Management of Data, pp. 821–833
Caruccio L, Deufemia V, Polese G (2020) Mining relaxed functional dependencies from data. Data Min Knowl Disc 34(2):443–477
Shami TM, El-Saleh AA, Alswaitti M, Al-Tashi Q, Summakieh MA, Mirjalili S (2022) Particle swarm optimization: a comprehensive survey. IEEE Access 10:10031–10061
Sweeney L (1997) Guaranteeing anonymity when sharing medical data, the Datafly System. In Proceedings of the Amia Annual Fall Symposium, American Medical Informatics Association, p. 51.
LeFevre K, DeWitt DJ and Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49–60
Xiao X and Tao Y (2006) Personalized privacy preservation. In proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 229–240
LeFevre K, DeWitt DJ and Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In 22nd International Conference on Data Engineering (ICDE'06), IEEE, pp. 25–25
Wong RC-W, Li J, Fu AW-C and Wang K (2006) (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, 2006, pp. 754–759
Ghinita G, Karras P, Kalnis P and Mamoulis N(2007) Fast data anonymization with low information loss. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 758–769
Fung BC, Wang K and Yu PS (2005) Top-down specialization for information and privacy preservation. In 21st International Conference on Data Engineering (ICDE'05), IEEE, pp. 205–216
Wei K, Li J, Ding M, Ma C, Yang HH, Farokhi F, Jin S, Quek TQ, Poor HV (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Trans Inf Forensics Secur 15:3454–3469
Madan S, Goswami P (2021) Adaptive privacy preservation approach for big data publishing in cloud using k-anonymization. Recent Adv Comput Sci Commun (Former: Recent Pat Comput Sci) 14(8):2678–2688
Madan S, Goswami P (2021) A technique for securing big data using k-anonymization with a hybrid optimization algorithm. Int J Operations Res Inf Syst (IJORIS) 12(4):1–21
Caruccio L, Desiato D, Polese G, Tortora G, Zannone N (2022) A decision-support framework for data anonymization with application to machine learning processes. Inf Sci 613:1–32
Sahana LR, Ranganatha HR (2022) An enhanced data anonymization approach for privacy preserving data publishing in cloud computing based on genetic chimp optimization. Int J Inf Secur Priv (IJISP) 16(1):1–16
Sai Kumar S, Reddy AR, Krishna BS, Rao JN, Kiran A (2022) Privacy preserving with modified grey wolf optimization over big data using optimal K anonymization approach. J Interconnect Netw 22(Supp01):2141039
Jha N, Vassio L, Trevisan M, Leonardi E, Mellia M (2023) Practical anonymization for data streams: z-anonymity and relation with k-anonymity. Perform Eval 159:102329
Ashkouti F, Khamforoosh K (2023) A distributed computing model for big data anonymization in the networks. PLoS One 18(4):e0285212
Patil RA, Patil PD (2024) Efficient approximation and privacy preservation algorithms for real time online evolving data streams. World Wide Web. https://doi.org/10.1007/s11280-024-01244-9
"Python", [Online]. Available: https://www.python.org. [Accessed 2024]
"Pandas", [Online]. Available: https://pandas.pydata.org. [Accessed 2024]
"Numpy", [Online]. Available: https://numpy.org. [Accessed 2024]
"scikit-learn", [Online]. Available: https://scikit-learn.org/stable/. [Accessed 2024]
"Deap", [Online]. Available: https://deap.readthedocs.io/en/master/. [Accessed 2024]
Author information
Authors and Affiliations
Contributions
Alireza Sadeghi-Nasab contributed to conceptualization, methodology, software, data curation, visualization, writing—original draft preparation. Mohsen Rahmani was involved in supervision, writing—reviewing and editing, validation.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sadeghi-Nasab, A., Rahmani, M. Optimizing data privacy: an RFD-based approach to anonymization strategy selection. J Supercomput 81, 134 (2025). https://doi.org/10.1007/s11227-024-06642-4
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06642-4
Keywords
Profiles
- Alireza Sadeghi-Nasab View author profile