skip to main content
research-article

Industrial Federated Topic Modeling

Published: 17 February 2021 Publication History

Abstract

Probabilistic topic modeling has been applied in a variety of industrial applications. Training a high-quality model usually requires a massive amount of data to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence, training topic models in industrial scenarios using conventional approaches faces a dilemma: A party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this article, we propose a framework named Industrial Federated Topic Modeling (iFTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immunity to privacy adversaries. iFTM is inspired by federated learning, supports two representative topic models (i.e., Latent Dirichlet Allocation and SentenceLDA) in industrial applications, and consists of novel techniques such as private Metropolis-Hastings, topic-wise normalization, and heterogeneous model integration. We conduct quantitative evaluations to verify the effectiveness of iFTM and deploy iFTM in two real-life applications to demonstrate its utility. Experimental results verify iFTM’s superiority over conventional topic modeling.

References

[1]
Corey Arnold and William Speier. 2012. A topic model of clinical reports. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1031--1032.
[2]
Georgios Balikas, Massih-Reza Amini, and Marianne Clausel. 2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 921--924.
[3]
Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-private query processing in private data federations. arXiv preprint arXiv:1810.01816 (2018).
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993--1022.
[5]
Peter Carey. 2018. Data Protection: A Practical Guide to UK and EU Law. Oxford University Press, Inc.
[6]
Mark J. Carman, Fabio Crestani, Morgan Harvey, and Mark Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1849--1852.
[7]
Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP’10). IEEE, 5394--5397.
[8]
Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A lossless federated learning framework. CoRR abs/1901.08755 (2019).
[9]
Cynthia Dwork. 2008. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation. 1--19.
[10]
Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9, 3--4 (2014), 211--407.
[11]
James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv preprint arXiv:1603.07294 (2016).
[12]
Zvi Galil and Giuseppe F. Italiano. 1991. Data structures and algorithms for disjoint set union problems. ACM Comput. Surv. 23, 3 (1991), 319--344.
[13]
Robin C. Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).
[14]
Walter R. Gilks, Sylvia Richardson, and David Spiegelhalter. 1995. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC.
[15]
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Na. Acad. Sci. 101, suppl 1 (2004), 5228--5235.
[16]
Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In Proceedings of the International Conference on Machine Learning. 555--563.
[17]
Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).
[18]
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677 (2017).
[19]
Morgan Harvey, Fabio Crestani, and Mark J. Carman. 2013. Building user profiles from topic models for personalised search. In Proceedings of the 22nd ACM International Conference on Information 8 Knowledge Management. ACM, 2309--2314.
[20]
Di Jiang, Kenneth Wai-Ting Leung, Wilfred Ng, and Hao Li. 2013. Beyond click graph: Topic modeling for search engine query log analysis. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 209--223.
[21]
Di Jiang, Yuanfeng Song, Yongxin Tong, Xueyang Wu, Weiwei Zhao, Qian Xu, and Qiang Yang. 2019. Federated topic modeling. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 1071--1080.
[22]
Yohan Jo and Alice H. Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 815--824.
[23]
Amir Karami, Aryya Gangopadhyay, Bin Zhou, and Hadi Karrazi. 2015. FLATM: A fuzzy logic approach topic model for medical documents. In Proceedings of the Conference of the North American Fuzzy Information Processing Society (NAFIPS’15) held jointly with the 5th World Conference on Soft Computing (WConSC’15). IEEE, 1--6.
[24]
Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Commun. 38, 1 (2002), 19--28.
[25]
David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2018. Federated learning for keyword spotting. arXiv preprint arXiv:1810.05512 (2018).
[26]
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-SGD: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).
[27]
Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure federated transfer learning. CoRR abs/1812.03337 (2018).
[28]
Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 121--128.
[29]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
[30]
David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009. Distributed algorithms for topic models. J. Mach. Learn. 10, Aug. (2009), 1801--1828.
[31]
David Newman, Sarvnaz Karimi, and Lawrence Cavedon. 2009. Using topic models to interpret MEDLINE’s medical subject headings. In Proceedings of the Australasian Joint Conference on Artificial Intelligence. Springer, 270--279.
[32]
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.
[33]
Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 (2016).
[34]
Mijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. 2016. Private topic modeling. arXiv preprint arXiv:1609.04120 (2016).
[35]
Ronald L. Rivest, Len Adleman, Michael L. Dertouzos, et al. 1978. On data banks and privacy homomorphisms. Found. Sec. Comput. 4, 11 (1978), 169--180.
[36]
Thomas Rusch, Paul Hofmarcher, Reinhold Hatzinger, Kurt Hornik, et al. 2013. Model trees with topic model preprocessing: An approach for data journalism illustrated with the Wikileaks Afghanistan war logs. Ann. Appl. Statist. 7, 2 (2013), 613--639.
[37]
Jacob M. Victor. 2013. The EU general data protection regulation: Toward a property regime for protecting data privacy. Yale LJ 123 (2013), 513.
[38]
Paul Voigt and Axel Von dem Bussche. 2017. The EU General Data Protection Regulation (GDPR). A Practical Guide, 1st (ed.). Springer International Publishing, Cham.
[39]
Jan Vosecky, Di Jiang, Kenneth Wai-Ting Leung, Kai Xing, and Wilfred Ng. 2014. Integrating social and auxiliary semantics for multifaceted topic modeling in Twitter. ACM Trans. Internet Technol. 14, 4 (2014), 27.
[40]
W. Gregory Voss. 2016. European Union data privacy law reform: General data protection regulation, privacy shield, and the right to delisting. Bus. Law. 72, 1 (2016), 221--233.
[41]
Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 424--433.
[42]
Yang Wang, Quanquan Gu, and Donald Brown. 2018. Differentially private hypothesis transfer learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 811--826.
[43]
Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. 2015. Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In Proceedings of the International Conference on Machine Learning (ICML’15), Vol. 15. 2493--2502.
[44]
Jonathan Wintrode and Sanjeev Khudanpur. 2014. Combining local and broad topic context to improve term detection. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 442--447.
[45]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 12.
[46]
Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the stochastic gradient sampler for large-scale LDA. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1975--1984.
[47]
Andrew Chi-Chih Yao. 1982. Protocols for secure computations. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS’82), Vol. 82. 160--164.
[48]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1351--1361.
[49]
Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web. ACM, 879--888.

Cited By

View all
  • (2024)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3282-718:1Online publication date: 1-Feb-2024
  • (2023)Federated User Modeling from Hierarchical InformationACM Transactions on Information Systems10.1145/356048541:2(1-33)Online publication date: 9-Feb-2023
  • (2023)Federated Learning in Smart Cities: Privacy and Security SurveyInformation Sciences10.1016/j.ins.2023.03.033Online publication date: Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 12, Issue 1
Regular Papers
February 2021
280 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3436534
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021
Accepted: 01 July 2020
Revised: 01 May 2020
Received: 01 November 2019
Published in TIST Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Topic models
  2. differential privacy
  3. federated learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • State Key Laboratory of Software Development Environment (Beihang University) Open Program
  • National Science Foundation of China (NSFC)
  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3282-718:1Online publication date: 1-Feb-2024
  • (2023)Federated User Modeling from Hierarchical InformationACM Transactions on Information Systems10.1145/356048541:2(1-33)Online publication date: 9-Feb-2023
  • (2023)Federated Learning in Smart Cities: Privacy and Security SurveyInformation Sciences10.1016/j.ins.2023.03.033Online publication date: Mar-2023
  • (2023)Assisted driving system based on federated reinforcement learningDisplays10.1016/j.displa.2023.10254780(102547)Online publication date: Dec-2023
  • (2023)EAGS: An extracting auxiliary knowledge graph model in multi-turn dialogue generationWorld Wide Web10.1007/s11280-022-01100-826:4(1545-1566)Online publication date: 1-Jul-2023
  • (2022)Research on medical data security sharing scheme based on homomorphic encryptionMathematical Biosciences and Engineering10.3934/mbe.202310620:2(2261-2279)Online publication date: 2022
  • (2022)Survey of recommender systems based on federated learningSCIENTIA SINICA Informationis10.1360/SSI-2021-032952:5(713)Online publication date: 12-May-2022
  • (2022)Dynamic-Aware Federated Learning for Face Forgery Video DetectionACM Transactions on Intelligent Systems and Technology10.1145/350181413:4(1-25)Online publication date: 4-Feb-2022
  • (2022)Efficient and secure pedestrian detection in intelligent vehicles based on federated learning2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860748(1-5)Online publication date: Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media