skip to main content
research-article

Utility-aware Privacy Perturbation for Training Data

Published: 13 February 2024 Publication History

Abstract

Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose a Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures: the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a γ-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.

References

[1]
Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. 2016. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814.
[2]
Xinjiao Li, Guowei Wu, Lin Yao, Weizhe Zhang, and Bin Zhang. 2021. Progress and future challenges of security attacks and defense mechanisms in machine learning. Journal of Software 32, 2 (2021), 406–423.
[3]
Martin Strobel and Reza Shokri. 2022. Data privacy and trustworthy machine learning. IEEE Security & Privacy1 (2022), 2–7.
[4]
Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, and Dejing Dou. 2022. From distributed machine learning to federated learning: A survey. Knowledge and Information Systems 64, 4 (2022), 885-917.
[5]
Christian Rechberger and Roman Walch. 2022. Privacy-preserving machine learning using cryptography. In Security and Artificial Intelligence. Springer, 109–129.
[6]
Nancy Victor, Daphne Lopez, and Jemal H. Abawajy. 2016. Privacy models for big data: A survey. International Journal of Big Data Intelligence 3, 1 (2016), 61–75.
[7]
Ruilin Liu and Hui Wang. 2010. Privacy-preserving data publishing. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW’10). 305–308. DOI:
[8]
Latanya Sweeney. 2002. k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002), 557–570. DOI:
[9]
Nagendra Kumar S. and R. Aparna. 2013. Sensitive attributes based privacy preserving in data mining using k-anonymity. International Journal of Computer Applications 84, 13 (2013), 1-6.
[10]
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 1 (2007), 3. DOI:
[11]
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-Closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07). 106–115. DOI:
[12]
Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 1054–1067. DOI:
[13]
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: Optimal bounds for privacy-preserving principal component analysis. In Symposium on Theory of Computing (STOC’14). 11–20. DOI:
[14]
Hwanjo Yu, Jaideep Vaidya, and Xiaoqian Jiang. 2006. Privacy-preserving SVM classification on vertically partitioned data. In 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06), Vol. 3918. 647–656. DOI:
[15]
Ning Wang, Xiaokui Xiao, Yin Yang, Jun Zhao, Siu Cheung Hui, Hyejin Shin, Junbum Shin, and Ge Yu. 2019. Collecting and analyzing multidimensional data with local differential privacy. In 35th IEEE International Conference on Data Engineering (ICDE’19). 638–649. DOI:
[16]
Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In 26th USENIX Security Symposium. 729–745. DOI:
[17]
Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP’06). Springer, 1–12. DOI:
[18]
Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2004. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research (JMLR) 5 (2004), 73–99.
[19]
Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2009. Kernel dimension reduction in regression. Annals of Statistics 37, 4 (2009), 1871–1905. DOI:
[20]
Jianbo Chen, Mitchell Stern, Martin J. Wainwright, and Michael I. Jordan. 2017. Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems (NIPS’17). 6946–6955.
[21]
Jiangnan Cheng, Ao Tang, and Sandeep Chinchali. 2021. Task-aware privacy preservation for multi-dimensional data. DOI:
[22]
Latanya Sweeney. 2002. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002), 571–588.
[23]
John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local privacy and statistical minimax rates. In 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS’13). 429–438. DOI:
[24]
Qingqing Ye, Xiaofeng Meng, Minjie Zhu, and Zheng Huo. 2018. Survey on local differential privacy. Journal of Software 29, 7 (2018), 1981–2005.
[25]
John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2018. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association 113, 521 (2018), 182–201. DOI:
[26]
Jiangnan Cheng, Ao Tang, and Sandeep Chinchali. 2021. Task-aware privacy preservation for multi-dimensional data. arXiv preprint arXiv:2110.02329.
[27]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.
[28]
Harald Cramér. 2016. Mathematical methods of statistics. In Mathematical Methods of Statistics (PMS-9). Vol. 9.
[29]
Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2009. Kernel dimension reduction in regression. Annals of Statistics 37, 4 (2009), 1871–1905.
[30]
Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3-4 (2014), 211-407.
[31]
Scikit-learn. 2021. Scikit-learn Supervised Learning Models. Retrieved March 22, 2021, from https://scikit-learn.org/stable/supervised_learning.html
[32]
UCI. 1996. Machine Learning Repository Data Series. Retrieved March 22, 2021, from http://archive.ics.uci.edu/ml/datasets
[33]
Ipums. 2021. Integrated Public Use Microdata Series. Retrieved March 22, 2021, from https://international.ipums.org

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 4
May 2024
707 pages
EISSN:1556-472X
DOI:10.1145/3613622
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2024
Online AM: 02 January 2024
Accepted: 15 December 2023
Received: 31 July 2022
Published in TKDD Volume 18, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Training data privacy
  2. Data perturbation
  3. Data utility

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Research Foundation of the Key Laboratory of Spaceborne Information Intelligent Interpretation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 238
    Total Downloads
  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)14
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media