Abstract
In local differential privacy (LDP), a challenging problem is the ability to generate high-dimensional data while efficiently capturing the correlation between attributes in a dataset. Existing solutions for low-dimensional data synthesis, which partition the privacy budget among all attributes, cease to be effective in high-dimensional scenarios due to the large-scale noise and communication cost caused by the high dimension. In fact, the high-dimensional characteristics not only bring challenges but also make it possible to apply some technologies to break this bottleneck. This paper presents SamPrivSyn for high-dimensional data synthesis under LDP, which is composed of a marginal sampling module and a data generation module. The marginal sampling module is used to sample from the original data to obtain two-way marginals. The sampling process is based on mutual information, which is updated iteratively to retain, as much as possible, the correlation between attributes. The data generation module is used to reconstruct the synthetic dataset from the sampled two-way marginals. Furthermore, this study conducted comparison experiments on the real-world datasets to demonstrate the effectiveness and efficiency of the proposed method, with results proving that SamPrivSyn can not only protect privacy but also retain the correlation information between the attributes.
Similar content being viewed by others
References
Wang W, Xi J, Chen H. Modeling and recognizing driver behavior based on driving data: a survey. Math Problems Eng, 2014, 2014: 1–20
Preis T, Moat H S, Stanley H E. Quantifying trading behavior in financial markets using google trends. Sci Rep, 2013, 3: 1684
Fredrikson M, Lantz E, Jha S, et al. Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: Proceedings of the 23rd USENIX Conference on Security Symposium, 2014. 17–32
Ohlhorst F J. Big Data Analytics: Turning Big Data Into Big Money. Hoboken: John Wiley & Sons, 2012
Dwork C. Differential Privacy: A Survey of Results. Berlin: Springer, 2008
Duchi J C, Jordan M I, Wainwright M J. Local privacy and statistical minimax rates. In: Proceedings of IEEE 54th Annual Symposium on Foundations of Computer Science, 2013
Nguyên T T, Xiao X K, Yang Y, et al. Collecting and analyzing data from smart device users with local differential privacy. 2016. ArXiv:1606.05053
Wang T, Li N, Jha S. Locally differentially private heavy hitter identification. IEEE Trans Dependable Secure Comput, 2019, 18: 982–993
Erlingsson ú, Pihur V, Korolova A. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014. 1054–1067
Differential Privacy Team, Apple. Learning with privacy at scale. 2017. https://machinelearning.apple.com/research/learning-with-privacy-at-scale
Kairouz P, Bonawitz K, Ramage D. Discrete distribution estimation under local privacy. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016. 2436–2444
Bassily R, Smith A. Local, private, efficient protocols for succinct histograms. In: Proceedings of the 47th ACM Symposium on Theory of Computing, 2015. 127–135
Ye M, Barg A. Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Trans Inform Theor, 2018, 64: 5662–5676
Xue Q, Zhu Y, Wang J. Joint distribution estimation and Naïve Bayes classification under local differential privacy. IEEE Trans Emerg Top Comput, 2021, 9: 2053–2063
Duchi J C, Jordan M I, Wainwright M J. Local privacy, data processing inequalities, and statistical minimax rates. 2013. ArXiv:1302.3203
Qin Z, Yang Y, Yu T, et al. Heavy hitter estimation over set-valued data with local differential privacy. In: Proceedings of ACM Sigsac Conference on Computer and Communications Security, 2016. 192–203
Ren X, Yu C M, Yu W, et al. LoPub: high-dimensional crowdsourced data publication with local differential privacy. IEEE Trans Inform Forensic Secur, 2018, 13: 2151–2166
Warner S L. Randomized response: a survey technique for eliminating evasive answer bias. J Am Statistical Assoc, 1965, 60: 63–69
Dwork C, Roth A. The algorithmic foundations of differential privacy. FNT Theor Comput Sci, 2014, 9: 211–407
Li N, Lyu M, Su D, et al. Differential privacy: from theory to practice. Synthesis Lectures Inf Security Privacy Trust, 2016, 8: 1–138
Mcsherry F, Talwar K. Mechanism design via differential privacy. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), 2007
Wei J, Lin Y, Yao X, et al. Differential privacy-based genetic matching in personalized medicine. IEEE Trans Emerg Top Comput, 2021, 9: 1109–1125
Kasiviswanathan S P, Lee H K, Nissim K, et al. What can we learn privately? SIAM J Comput, 2008, 40: 793–826
Kairouz P, Oh S, Viswanath P. Extremal Mechanisms for Local Differential Privacy. Cambridge: MIT Press, 2014
Wang T, Blocki J, Jha S K. Locally differentially private protocols for frequency estimation. In: Proceedings of the 26th USENIX Security Symposium, 2017
Zhang Z, Wang T, Li N, et al. CALM: consistent adaptive local marginal for marginal release under local differential privacy. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018. 212–229
Wang N, Xiao X, Yang Y, et al. Collecting and analyzing multidimensional data with local differential privacy. In: Proceedings of IEEE 35th Annual International Conference on Data Engineering (ICDE), 2019
Ye Q, Hu H, Meng X, et al. PrivKV: key-value data collection with local differential privacy. In: Proceedings of IEEE Symposium on Security and Privacy (SP), 2019. 317–331
Gu X, Li M, Cheng Y, et al. PCKV: locally differentially private correlated key-value data collection with optimized utility. In: Proceedings of the 29th USENIX Security Symposium, 2020. 967–984
Sun L, Zhao J, Ye X, et al. Conditional analysis for key-value data with local differential privacy. 2019. ArXiv:1907.05014
Cormode G, Kulkarni T, Srivastava D. Answering range queries under local differential privacy. Proc VLDB Endow, 2019, 12: 1126–1138
Wang T, Ding B, Zhou J, et al. Answering multi-dimensional analytical queries under local differential privacy. In: Proceedings of the International Conference on Management of Data, 2019. 159–176
Du L, Zhang Z, Bai S, et al. AHEAD: adaptive hierarchical decomposition for range query under local differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2021. 1266–1288
Zhang Z, Wang T, Honorio J, et al. PrivSyn: differentially private data synthesis. In: Proceedings of the 30th USENIX Security Symposium, 2021
Acknowledgements
This work was supported by Strategic Research and Consulting Project of the Chinese Academy of Engineering (Grant No. 2022-XY-107).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, X., Wang, C., Yang, Q. et al. Locally differentially private high-dimensional data synthesis. Sci. China Inf. Sci. 66, 112101 (2023). https://doi.org/10.1007/s11432-022-3583-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-022-3583-x