Skip to main content
Log in

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Latent Dirichlet allocation (LDA) is a topic model widely used for discovering hidden semantics in massive text corpora. Collapsed Gibbs sampling (CGS), as a widely-used algorithm for learning the parameters of LDA, has the risk of privacy leakage. Specifically, word count statistics and updates of latent topics in CGS, which are essential for parameter estimation, could be employed by adversaries to conduct effective membership inference attacks (MIAs). Till now, there are two kinds of methods exploited in CGS to defend against MIAs: adding noise to word count statistics and utilizing inherent privacy. These two kinds of methods have their respective limitations. Noise sampled from the Laplacian distribution sometimes produces negative word count statistics, which render terrible parameter estimation in CGS. Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs. It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy. The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived. It is the first time that Rényi differential privacy (RDP) has been introduced into CGS and we propose RDP-LDA, an effective framework for analyzing the privacy loss of any differentially private CGS. RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained by ε-DP. In RDP-LDA, we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative. And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy. Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Blei D M, Ng A Y, Jordan M. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 933-1022.

    MATH  Google Scholar 

  2. Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Zeng J, Yang Q, Law C. Towards topic modeling for big data. arXiv:1405.4402v1, 2014. https://arxiv.org/abs/1405.4402v1, June 2022.

  3. Yu L, Zhang C, Shao Y, Cui B. LDA*: A robust and large-scale topic modeling system. Proceedings of the VLDB Endowment, 2017, 10(11): 1406-1417. https://doi.org/10.14778/3137628.3137649.

    Article  Google Scholar 

  4. Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing E, Liu T, Ma W. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1351-1361. https://doi.org/10.1145/2736277.2741115.

  5. Zhao F, Ren X, Yang S, Yang X. On privacy protection of latent Dirichlet allocation model training. arXiv:1906.01178, 2019. https://arxiv.org/abs/1906.01178, July 2022.

  6. Zhao F, Ren X, Yang S, Han Q, Zhao P, Yang X. Latent Dirichlet allocation model training with differential privacy. IEEE Trans. Information Forensics and Security, 2021, 16: 1290-1305. https://doi.org/10.1109/TIFS.2020.3032021.

    Article  Google Scholar 

  7. Cynthia D, Aaron R. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2013, 9(3/4): 211-407. https://doi.org/10.1561/0400000042.

    Article  MathSciNet  MATH  Google Scholar 

  8. Zheng S. The differential privacy of Bayesian inference [Ph.D. Thesis]. Harvard College, 2015.

  9. Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv:1603.07294v2, 2016. https://arxiv.org/abs/1603.07294v2, June 2022.

  10. Zhu T, Li G, Zhou W, Xiong P, Yuan C. Privacy-preserving topic model for tagging recommender systems. Knowledge and Information Systems, 2016, 46(1): 33-58. https://doi.org/10.1007/s10115-015-0832-9.

    Article  Google Scholar 

  11. Zhang Z, Rubinstein B I P, Dimitrakakis C. On the differential privacy of Bayesian inference. In Proc. the 30th AAAI Conference on Artificial Intelligence, Feb. 2016, pp.2365-2371. https://doi.org/10.1609/aaai.v30i1.10254.

  12. Wang Y, Tong Y, Shi D. Federated latent Dirichlet allocation: A local differential privacy based framework. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.6283-6290. https://doi.org/10.1609/aaai.v34i04.6096.

  13. Abadi M, Chu A, Goodfellow I, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proc. the 23rd ACM SIGSAC Conference on Computer and Communications Security, Oct. 2016, pp.308-318. https://doi.org/10.1145/2976749.2978318.

  14. Yu L, Liu L, Pu C, Gursoy M E, Truex S. Differentially private model publishing for deep learning. In Proc. the 40th IEEE Symp. Security and Privacy, May 2019, pp.332-349. https://doi.org/10.1109/SP.2019.00019.

  15. Mironov I. Rényi differential privacy. In Proc. the 30th IEEE Computer Security Foundations Symposium, Aug. 2017, pp.263-275. https://doi.org/10.1109/CSF.2017.11.

  16. Porteous I, Newman D, Ihler A T, Asuncion A U, Smyth P, Welling M. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2008, pp.569-577. https://doi.org/10.1145/1401890.1401960.

  17. Xiao H, Stibor T. Efficient collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 2nd Asian Conference on Machine Learning, Nov. 2010, pp.63-78.

  18. MacKay D J C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

  19. Van Ravenzwaaij D, Cassey P, Brown S D. A simple introduction to Markov chain Monte-Carlo sampling. Psychonomic Bulletin & Review, 2018, 25(1): 143-154. https://doi.org/10.3758/s13423-016-1015-8.

  20. Hesterberg T. Monte Carlo strategies in scientific computing. Technometrics, 2002, 44(4): 403-404. https://doi.org/10.1198/tech.2002.s85.

    Article  Google Scholar 

  21. Bun M, Steinke T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proc. the 14th International Theory of Cryptography Conference, October 31-November 3, 2016, pp.635-658. https://doi.org/10.1007/978-3-662-53641-4_24.

  22. Hu C, Cao H, Gong Q. Sub-Gibbs sampling: A new strategy for inferring LDA. In Proc. the 17th IEEE International Conference on Data Mining, Nov. 2017, pp.907-912. 10.1109/ICDM.2017.113.

  23. Goldberger J, Gordon S, Greenspan H. An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proc. the 9th IEEE International Conference on Computer Vision, Oct. 2003, pp.487-493. https://doi.org/10.1109/ICCV.2003.1238387.

  24. Hofmann T. Probabilistic latent semantic analysis. arXiv:1301.6705v1, 2013. https://arxiv.org/abs/1301.6705v1, Jun. 2022.

  25. Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T L. Probabilistic author-topic models for information discovery. In Proc. the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.306-315. https://doi.org/10.1145/1014052.1014087.

  26. Salem A, Zhang Y, Humbert M, Berrang P, Fritz M, Backes M. ML-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proc. the 26th Annual Network and Distributed System Security Symposium, Feb. 2019. https://doi.org/10.14722/NDSS.2019.23119.

  27. Yıldırım S, Ermiş B. Exact MCMC with differentially private moves—Revisiting the penalty algorithm in a data privacy framework. Statistics and Computing, 2019, 29(5): 947-963. https://doi.org/10.1007/s11222-018-9847-x.

    Article  MathSciNet  MATH  Google Scholar 

  28. Bernstein G, Sheldon D. Differentially private Bayesian inference for exponential families. arXiv:1809.02188v3, 2018. https://arxiv.org/abs/1809.02188v3, Jun. 2022.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Chen.

Supplementary Information

ESM 1

(PDF 158 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, T., Zhao, SY., Chen, H. et al. Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy. J. Comput. Sci. Technol. 37, 1382–1397 (2022). https://doi.org/10.1007/s11390-022-2425-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-022-2425-x

Keywords

Navigation