research-article

Utility-aware Privacy Perturbation for Training Data

Authors:

Zhaolong Zheng,

Shisong GengAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 4

Article No.: 103, Pages 1 - 21

https://doi.org/10.1145/3639411

Published: 13 February 2024 Publication History

Abstract

Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose a Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures: the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a γ-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.

References

[1]

Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. 2016. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814.

[2]

Xinjiao Li, Guowei Wu, Lin Yao, Weizhe Zhang, and Bin Zhang. 2021. Progress and future challenges of security attacks and defense mechanisms in machine learning. Journal of Software 32, 2 (2021), 406–423.

[3]

Martin Strobel and Reza Shokri. 2022. Data privacy and trustworthy machine learning. IEEE Security & Privacy1 (2022), 2–7.

[4]

Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, and Dejing Dou. 2022. From distributed machine learning to federated learning: A survey. Knowledge and Information Systems 64, 4 (2022), 885-917.

[5]

Christian Rechberger and Roman Walch. 2022. Privacy-preserving machine learning using cryptography. In Security and Artificial Intelligence. Springer, 109–129.

[6]

Nancy Victor, Daphne Lopez, and Jemal H. Abawajy. 2016. Privacy models for big data: A survey. International Journal of Big Data Intelligence 3, 1 (2016), 61–75.

[7]

Ruilin Liu and Hui Wang. 2010. Privacy-preserving data publishing. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW’10). 305–308. DOI:

[8]

Latanya Sweeney. 2002. k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002), 557–570. DOI:

Digital Library

[9]

Nagendra Kumar S. and R. Aparna. 2013. Sensitive attributes based privacy preserving in data mining using k-anonymity. International Journal of Computer Applications 84, 13 (2013), 1-6.

[10]

Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 1 (2007), 3. DOI:

Digital Library

[11]

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-Closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07). 106–115. DOI:

[12]

Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 1054–1067. DOI:

Digital Library

[13]

Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: Optimal bounds for privacy-preserving principal component analysis. In Symposium on Theory of Computing (STOC’14). 11–20. DOI:

Digital Library

[14]

Hwanjo Yu, Jaideep Vaidya, and Xiaoqian Jiang. 2006. Privacy-preserving SVM classification on vertically partitioned data. In 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06), Vol. 3918. 647–656. DOI:

Digital Library

[15]

Ning Wang, Xiaokui Xiao, Yin Yang, Jun Zhao, Siu Cheung Hui, Hyejin Shin, Junbum Shin, and Ge Yu. 2019. Collecting and analyzing multidimensional data with local differential privacy. In 35th IEEE International Conference on Data Engineering (ICDE’19). 638–649. DOI:

[16]

Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In 26th USENIX Security Symposium. 729–745. DOI:

Digital Library

[17]

Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP’06). Springer, 1–12. DOI:

Digital Library

[18]

Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2004. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research (JMLR) 5 (2004), 73–99.

Digital Library

[19]

Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2009. Kernel dimension reduction in regression. Annals of Statistics 37, 4 (2009), 1871–1905. DOI:

[20]

Jianbo Chen, Mitchell Stern, Martin J. Wainwright, and Michael I. Jordan. 2017. Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems (NIPS’17). 6946–6955.

[21]

Jiangnan Cheng, Ao Tang, and Sandeep Chinchali. 2021. Task-aware privacy preservation for multi-dimensional data. DOI:

[22]

Latanya Sweeney. 2002. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002), 571–588.

Digital Library

[23]

John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local privacy and statistical minimax rates. In 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS’13). 429–438. DOI:

Digital Library

[24]

Qingqing Ye, Xiaofeng Meng, Minjie Zhu, and Zheng Huo. 2018. Survey on local differential privacy. Journal of Software 29, 7 (2018), 1981–2005.

[25]

John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2018. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association 113, 521 (2018), 182–201. DOI:

[26]

Jiangnan Cheng, Ao Tang, and Sandeep Chinchali. 2021. Task-aware privacy preservation for multi-dimensional data. arXiv preprint arXiv:2110.02329.

[27]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.

[28]

Harald Cramér. 2016. Mathematical methods of statistics. In Mathematical Methods of Statistics (PMS-9). Vol. 9.

[29]

Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. 2009. Kernel dimension reduction in regression. Annals of Statistics 37, 4 (2009), 1871–1905.

[30]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3-4 (2014), 211-407.

[31]

Scikit-learn. 2021. Scikit-learn Supervised Learning Models. Retrieved March 22, 2021, from https://scikit-learn.org/stable/supervised_learning.html

[32]

UCI. 1996. Machine Learning Repository Data Series. Retrieved March 22, 2021, from http://archive.ics.uci.edu/ml/datasets

[33]

Ipums. 2021. Integrated Public Use Microdata Series. Retrieved March 22, 2021, from https://international.ipums.org

Index Terms

Utility-aware Privacy Perturbation for Training Data
1. Security and privacy

Recommendations

An efficient perturbation approach for multivariate data in sensitive and reliable data mining
Abstract
Due to the rapid enhancement of technology, cloud data is increasing rapidly which contains individuals’ sensitive information such as medical diagnostics reports. While extracting knowledge from those sensitive data, both privacy of individuals’ ...
Enhancing data utility in differential privacy via microaggregation-based k-anonymity

It is not uncommon in the data anonymization literature to oppose the "old" k-anonymity model to the "new" differential privacy model, which offers more robust privacy guarantees. Yet, it is often disregarded that the utility of the anonymized results ...
A secure K-automorphism privacy preserving approach with high data utility in social networks

The prevalence of social networks has raised the concern for individual privacy leakage. Although preserving user privacy in social networks has been studied extensively, there is still a lot of room for improvement in the state of the art techniques. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 18, Issue 4

May 2024

707 pages

EISSN:1556-472X

DOI:10.1145/3613622

Editor:
Jian Pei
Duke University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2024

Online AM: 02 January 2024

Accepted: 15 December 2023

Received: 31 July 2022

Published in TKDD Volume 18, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Research Foundation of the Key Laboratory of Spaceborne Information Intelligent Interpretation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
238
Total Downloads

Downloads (Last 12 months)158
Downloads (Last 6 weeks)14

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents