Skip to main content

Advertisement

Log in

Optimizing data privacy: an RFD-based approach to anonymization strategy selection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper presents a novel Anonymization Strategy Selection Framework that combines relaxed functional dependencies (RFDs) and particle swarm optimization (PSO) to balance data privacy and utility. Our approach extracts RFDs from datasets, generates diverse anonymization strategies using domain generalization hierarchies, and employs PSO for strategy optimization. We introduce a fitness function that balances k-anonymity and information loss. The framework's innovation lies in using RFDs to capture fine-grained data dependencies, enabling more nuanced anonymization. Evaluation on widely used UCI machine learning repository datasets show our framework outperforms existing techniques, achieving higher k-anonymity levels with lower information loss. Our adaptive approach generates hybrid strategies combining elements from multiple RFDs, resulting in superior privacy-utility trade-offs. This research advances privacy-preserving data publishing by providing a flexible, effective tool for generating anonymized datasets that maintain high utility for downstream analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

No datasets were generated or analyzed during the current study.

Notes

  1. https://github.com/AlirezaSN/data-anonymization-framework.

References

  1. Yuvaraj N, Praghash K, Karthikeyan T (2022) Privacy preservation of the user data and properly balancing between privacy and utility. Int J Bus Intell Data Min 20(4):394–411

    Google Scholar 

  2. Pujol D, McKenna R, Kuppam S, Hay M, Machanavajjhala A and Miklau G (2020) Fair decision making using privacy-protected data. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 189–199

  3. Sweeney L (2002) k-anonymity: a model for protecting privacy. Internat J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570

    Article  MathSciNet  Google Scholar 

  4. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity,". ACM Trans Knowl Discov From Data (TKDD) 1(1):3

    Article  Google Scholar 

  5. Li N, Li T and Venkatasubramanian S (2006) t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115

  6. Farrand T, Mireshghallah F, Singh S and Trask A (2020) Neither private nor fair: impact of data imbalance on utility and fairness in differential privacy. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning In Practice, pp. 15–19

  7. Dong J, Roth A, Su WJ (2022) Gaussian differential privacy. J Royal Stat Soc: Ser B (Stat Methodol) 84(1):3–37

    Article  MathSciNet  Google Scholar 

  8. Yin X, Zhu Y, Hu J (2021) A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput Surv (CSUR) 54(6):1–36

    Article  Google Scholar 

  9. Olatunji IE, Rauch J, Katzensteiner M, Khosla M (2022) A review of anonymization for healthcare data. Big data. https://doi.org/10.1089/big.2021.0169

    Article  Google Scholar 

  10. Wang J, Shen HT, Song J and Ji J (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927

  11. Kennedy J and Eberhart R (1995) Particle swarm optimization. Proceedings of ICNN'95-International Conference on Neural Networks 4:1942–1948

  12. "Adult", UCI machine learning repository, [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. [Accessed 2022]

  13. "Statlog (German credit data)," [Online]. Available: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data. [Accessed 2024]

  14. Motwani R and Xu Y (2007) Efficient algorithms for masking and finding quasi-identifiers. In Proceedings of The Conference on Very Large Data Bases (VLDB), pp. 83–93

  15. Mahanan W, Chaovalitwongse WA, Natwichai J (2021) Data privacy preservation algorithm with k-anonymity. World Wide Web 24(5):1551–1561

    Article  Google Scholar 

  16. Kumar BS, Daniya T, Sathya N and Cristin R (2020) Investigation on privacy preserving using K-anonymity techniques. In International Conference on Computer Communication and Informatics (ICCCI), IEEE, pp. 1–7

  17. Parameshwarappa P, Chen Z, Koru G (2021) Anonymization of daily activity data by using ℓ-diversity privacy model. ACM Trans Manag Inf Syst (TMIS) 12(3):1–23

    Article  Google Scholar 

  18. Gangarde R, Sharma A, Pawar A, Joshi R, Gonge S (2021) Privacy preservation in online social networks using multiple-graph-properties-based clustering to ensure k-anonymity, l-diversity, and t-closeness. Electronics 10(22):2877

    Article  Google Scholar 

  19. Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442

    Article  MathSciNet  Google Scholar 

  20. Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell 45(4):4396–4415

    Google Scholar 

  21. Ichihashi S (2020) Online privacy and information disclosure by consumers. Am Econ Rev 110(2):569–595

    Article  Google Scholar 

  22. Nisha, Singhal A and Muttoo SK (2022) Anonymization of multi-relations datasets using single table algorithms. In information and communication technology for competitive strategies (ICTCS 2021) intelligent strategies for ICT, Springer, Singapore, pp. 21–30

  23. Dhinakaran D, Prathap PJ (2022) Protection of data privacy from vulnerability using two-fish technique with Apriori algorithm in data mining. J Supercomput 78(16):17559–17593

    Article  Google Scholar 

  24. T. Papenbrock and F. Naumann (2016) A hybrid approach to functional dependency discovery. In proceedings of the 2016 International Conference on Management of Data, pp. 821–833

  25. Caruccio L, Deufemia V, Polese G (2020) Mining relaxed functional dependencies from data. Data Min Knowl Disc 34(2):443–477

    Article  MathSciNet  Google Scholar 

  26. Shami TM, El-Saleh AA, Alswaitti M, Al-Tashi Q, Summakieh MA, Mirjalili S (2022) Particle swarm optimization: a comprehensive survey. IEEE Access 10:10031–10061

    Article  Google Scholar 

  27. Sweeney L (1997) Guaranteeing anonymity when sharing medical data, the Datafly System. In Proceedings of the Amia Annual Fall Symposium, American Medical Informatics Association, p. 51.

  28. LeFevre K, DeWitt DJ and Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49–60

  29. Xiao X and Tao Y (2006) Personalized privacy preservation. In proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 229–240

  30. LeFevre K, DeWitt DJ and Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In 22nd International Conference on Data Engineering (ICDE'06), IEEE, pp. 25–25

  31. Wong RC-W, Li J, Fu AW-C and Wang K (2006) (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, 2006, pp. 754–759

  32. Ghinita G, Karras P, Kalnis P and Mamoulis N(2007) Fast data anonymization with low information loss. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 758–769

  33. Fung BC, Wang K and Yu PS (2005) Top-down specialization for information and privacy preservation. In 21st International Conference on Data Engineering (ICDE'05), IEEE, pp. 205–216

  34. Wei K, Li J, Ding M, Ma C, Yang HH, Farokhi F, Jin S, Quek TQ, Poor HV (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Trans Inf Forensics Secur 15:3454–3469

    Article  Google Scholar 

  35. Madan S, Goswami P (2021) Adaptive privacy preservation approach for big data publishing in cloud using k-anonymization. Recent Adv Comput Sci Commun (Former: Recent Pat Comput Sci) 14(8):2678–2688

    Article  Google Scholar 

  36. Madan S, Goswami P (2021) A technique for securing big data using k-anonymization with a hybrid optimization algorithm. Int J Operations Res Inf Syst (IJORIS) 12(4):1–21

    Article  Google Scholar 

  37. Caruccio L, Desiato D, Polese G, Tortora G, Zannone N (2022) A decision-support framework for data anonymization with application to machine learning processes. Inf Sci 613:1–32

    Article  Google Scholar 

  38. Sahana LR, Ranganatha HR (2022) An enhanced data anonymization approach for privacy preserving data publishing in cloud computing based on genetic chimp optimization. Int J Inf Secur Priv (IJISP) 16(1):1–16

    Article  Google Scholar 

  39. Sai Kumar S, Reddy AR, Krishna BS, Rao JN, Kiran A (2022) Privacy preserving with modified grey wolf optimization over big data using optimal K anonymization approach. J Interconnect Netw 22(Supp01):2141039

    Article  Google Scholar 

  40. Jha N, Vassio L, Trevisan M, Leonardi E, Mellia M (2023) Practical anonymization for data streams: z-anonymity and relation with k-anonymity. Perform Eval 159:102329

    Article  Google Scholar 

  41. Ashkouti F, Khamforoosh K (2023) A distributed computing model for big data anonymization in the networks. PLoS One 18(4):e0285212

    Article  Google Scholar 

  42. Patil RA, Patil PD (2024) Efficient approximation and privacy preservation algorithms for real time online evolving data streams. World Wide Web. https://doi.org/10.1007/s11280-024-01244-9

    Article  Google Scholar 

  43. "Python", [Online]. Available: https://www.python.org. [Accessed 2024]

  44. "Pandas", [Online]. Available: https://pandas.pydata.org. [Accessed 2024]

  45. "Numpy", [Online]. Available: https://numpy.org. [Accessed 2024]

  46. "scikit-learn", [Online]. Available: https://scikit-learn.org/stable/. [Accessed 2024]

  47. "Deap", [Online]. Available: https://deap.readthedocs.io/en/master/. [Accessed 2024]

Download references

Author information

Authors and Affiliations

Authors

Contributions

Alireza Sadeghi-Nasab contributed to conceptualization, methodology, software, data curation, visualization, writing—original draft preparation. Mohsen Rahmani was involved in supervision, writing—reviewing and editing, validation.

Corresponding author

Correspondence to Alireza Sadeghi-Nasab.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

DGHs for adult dataset

See Figs.

Fig. 4
figure 4

DGH for age attribute

4,

Fig. 5
figure 5

DGH for working-class attribute

5,

Fig. 6
figure 6

DGH for relationship attribute

6,

Fig. 7
figure 7

DGH for gender attribute

7,

Fig. 8
figure 8

DGH for education-level attribute

8,

Fig. 9
figure 9

DGH for race attribute

9 and

Fig. 10
figure 10

DGH for occupation attribute

10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadeghi-Nasab, A., Rahmani, M. Optimizing data privacy: an RFD-based approach to anonymization strategy selection. J Supercomput 81, 134 (2025). https://doi.org/10.1007/s11227-024-06642-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06642-4

Keywords

Profiles

  1. Alireza Sadeghi-Nasab