Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Huang, Tao; Zhao, Su-Yun; Chen, Hong; Liu, Yi-Xuan

doi:10.1007/s11390-022-2425-x

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Regular Paper
Published: 30 November 2022

Volume 37, pages 1382–1397, (2022)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Tao Huang^1,2,
Su-Yun Zhao^1,2,
Hong Chen^1,2 &
…
Yi-Xuan Liu^1,2

99 Accesses
Explore all metrics

Abstract

Latent Dirichlet allocation (LDA) is a topic model widely used for discovering hidden semantics in massive text corpora. Collapsed Gibbs sampling (CGS), as a widely-used algorithm for learning the parameters of LDA, has the risk of privacy leakage. Specifically, word count statistics and updates of latent topics in CGS, which are essential for parameter estimation, could be employed by adversaries to conduct effective membership inference attacks (MIAs). Till now, there are two kinds of methods exploited in CGS to defend against MIAs: adding noise to word count statistics and utilizing inherent privacy. These two kinds of methods have their respective limitations. Noise sampled from the Laplacian distribution sometimes produces negative word count statistics, which render terrible parameter estimation in CGS. Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs. It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy. The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived. It is the first time that Rényi differential privacy (RDP) has been introduced into CGS and we propose RDP-LDA, an effective framework for analyzing the privacy loss of any differentially private CGS. RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained by ε-DP. In RDP-LDA, we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative. And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy. Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FDP-LDA: Inherent Privacy Amplification of Collapsed Gibbs Sampling via Group Subsampling

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

References

Blei D M, Ng A Y, Jordan M. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 933-1022.
MATH Google Scholar
Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Zeng J, Yang Q, Law C. Towards topic modeling for big data. arXiv:1405.4402v1, 2014. https://arxiv.org/abs/1405.4402v1, June 2022.
Yu L, Zhang C, Shao Y, Cui B. LDA*: A robust and large-scale topic modeling system. Proceedings of the VLDB Endowment, 2017, 10(11): 1406-1417. https://doi.org/10.14778/3137628.3137649.
Article Google Scholar
Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing E, Liu T, Ma W. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1351-1361. https://doi.org/10.1145/2736277.2741115.
Zhao F, Ren X, Yang S, Yang X. On privacy protection of latent Dirichlet allocation model training. arXiv:1906.01178, 2019. https://arxiv.org/abs/1906.01178, July 2022.
Zhao F, Ren X, Yang S, Han Q, Zhao P, Yang X. Latent Dirichlet allocation model training with differential privacy. IEEE Trans. Information Forensics and Security, 2021, 16: 1290-1305. https://doi.org/10.1109/TIFS.2020.3032021.
Article Google Scholar
Cynthia D, Aaron R. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2013, 9(3/4): 211-407. https://doi.org/10.1561/0400000042.
Article MathSciNet MATH Google Scholar
Zheng S. The differential privacy of Bayesian inference [Ph.D. Thesis]. Harvard College, 2015.
Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv:1603.07294v2, 2016. https://arxiv.org/abs/1603.07294v2, June 2022.
Zhu T, Li G, Zhou W, Xiong P, Yuan C. Privacy-preserving topic model for tagging recommender systems. Knowledge and Information Systems, 2016, 46(1): 33-58. https://doi.org/10.1007/s10115-015-0832-9.
Article Google Scholar
Zhang Z, Rubinstein B I P, Dimitrakakis C. On the differential privacy of Bayesian inference. In Proc. the 30th AAAI Conference on Artificial Intelligence, Feb. 2016, pp.2365-2371. https://doi.org/10.1609/aaai.v30i1.10254.
Wang Y, Tong Y, Shi D. Federated latent Dirichlet allocation: A local differential privacy based framework. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.6283-6290. https://doi.org/10.1609/aaai.v34i04.6096.
Abadi M, Chu A, Goodfellow I, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proc. the 23rd ACM SIGSAC Conference on Computer and Communications Security, Oct. 2016, pp.308-318. https://doi.org/10.1145/2976749.2978318.
Yu L, Liu L, Pu C, Gursoy M E, Truex S. Differentially private model publishing for deep learning. In Proc. the 40th IEEE Symp. Security and Privacy, May 2019, pp.332-349. https://doi.org/10.1109/SP.2019.00019.
Mironov I. Rényi differential privacy. In Proc. the 30th IEEE Computer Security Foundations Symposium, Aug. 2017, pp.263-275. https://doi.org/10.1109/CSF.2017.11.
Porteous I, Newman D, Ihler A T, Asuncion A U, Smyth P, Welling M. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2008, pp.569-577. https://doi.org/10.1145/1401890.1401960.
Xiao H, Stibor T. Efficient collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 2nd Asian Conference on Machine Learning, Nov. 2010, pp.63-78.
MacKay D J C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
Van Ravenzwaaij D, Cassey P, Brown S D. A simple introduction to Markov chain Monte-Carlo sampling. Psychonomic Bulletin & Review, 2018, 25(1): 143-154. https://doi.org/10.3758/s13423-016-1015-8.
Hesterberg T. Monte Carlo strategies in scientific computing. Technometrics, 2002, 44(4): 403-404. https://doi.org/10.1198/tech.2002.s85.
Article Google Scholar
Bun M, Steinke T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proc. the 14th International Theory of Cryptography Conference, October 31-November 3, 2016, pp.635-658. https://doi.org/10.1007/978-3-662-53641-4_24.
Hu C, Cao H, Gong Q. Sub-Gibbs sampling: A new strategy for inferring LDA. In Proc. the 17th IEEE International Conference on Data Mining, Nov. 2017, pp.907-912. 10.1109/ICDM.2017.113.
Goldberger J, Gordon S, Greenspan H. An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proc. the 9th IEEE International Conference on Computer Vision, Oct. 2003, pp.487-493. https://doi.org/10.1109/ICCV.2003.1238387.
Hofmann T. Probabilistic latent semantic analysis. arXiv:1301.6705v1, 2013. https://arxiv.org/abs/1301.6705v1, Jun. 2022.
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T L. Probabilistic author-topic models for information discovery. In Proc. the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.306-315. https://doi.org/10.1145/1014052.1014087.
Salem A, Zhang Y, Humbert M, Berrang P, Fritz M, Backes M. ML-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proc. the 26th Annual Network and Distributed System Security Symposium, Feb. 2019. https://doi.org/10.14722/NDSS.2019.23119.
Yıldırım S, Ermiş B. Exact MCMC with differentially private moves—Revisiting the penalty algorithm in a data privacy framework. Statistics and Computing, 2019, 29(5): 947-963. https://doi.org/10.1007/s11222-018-9847-x.
Article MathSciNet MATH Google Scholar
Bernstein G, Sheldon D. Differentially private Bayesian inference for exponential families. arXiv:1809.02188v3, 2018. https://arxiv.org/abs/1809.02188v3, Jun. 2022.

Download references

Author information

Authors and Affiliations

Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of Education, Beijing, 100087, China
Tao Huang, Su-Yun Zhao, Hong Chen & Yi-Xuan Liu
School of Information, Renmin University of China, Beijing, 100087, China
Tao Huang, Su-Yun Zhao, Hong Chen & Yi-Xuan Liu

Authors

Tao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Su-Yun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Xuan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Chen.

Supplementary Information

ESM 1

(PDF 158 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, T., Zhao, SY., Chen, H. et al. Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy. J. Comput. Sci. Technol. 37, 1382–1397 (2022). https://doi.org/10.1007/s11390-022-2425-x

Download citation

Received: 15 April 2022
Accepted: 06 September 2022
Published: 30 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11390-022-2425-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Abstract

Access this article

Similar content being viewed by others

FDP-LDA: Inherent Privacy Amplification of Collapsed Gibbs Sampling via Group Subsampling

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Abstract

Access this article

Similar content being viewed by others

FDP-LDA: Inherent Privacy Amplification of Collapsed Gibbs Sampling via Group Subsampling

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation