skip to main content
10.1145/3357384.3357909acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Federated Topic Modeling

Published: 03 November 2019 Publication History

Abstract

Topic modeling has been widely applied in a variety of industrial applications. Training a high-quality model usually requires massive amount of in-domain data, in order to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence training topic models in industrial scenarios using conventional approaches faces a dilemma: a party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this paper, we propose a novel framework named Federated Topic Modeling (FTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immune to privacy adversaries. FTM is inspired by federated learning and consists of novel techniques such as private Metropolis Hastings, topic-wise normalization and heterogeneous model integration. We conduct a series of quantitative evaluations to verify the effectiveness of FTM and deploy FTM in an Automatic Speech Recognition (ASR) system to demonstrate its utility in real-life applications. Experimental results verify FTM's superiority over conventional topic modeling.

References

[1]
Corey Arnold and William Speier. 2012. A topic model of clinical reports. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 1031--1032.
[2]
Georgios Balikas, Massih-Reza Amini, and Marianne Clausel. 2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 921--924.
[3]
Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-Private Query Processing in Private Data Federations. arXiv preprint arXiv:1810.01816 (2018).
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, Vol. 3, Jan (2003), 993--1022.
[5]
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2016. Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482 (2016).
[6]
Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. 2018. Federated learning of predictive models from federated Electronic Health Records. International journal of medical informatics, Vol. 112 (2018), 59--67.
[7]
Peter Carey. 2018. Data protection: a practical guide to UK and EU law .Oxford University Press, Inc.
[8]
Mark J Carman, Fabio Crestani, Morgan Harvey, and Mark Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 1849--1852.
[9]
Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 5394--5397.
[10]
Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A Lossless Federated Learning Framework. CoRR, Vol. abs/1901.08755 (2019). arxiv: 1901.08755 http://arxiv.org/abs/1901.08755
[11]
Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi'an, China, April 25--29, 2008. Proceedings. 1--19.
[12]
Cynthia Dwork, Aaron Roth, et almbox. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, Vol. 9, 3--4 (2014), 211--407.
[13]
James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv preprint arXiv:1603.07294 (2016).
[14]
Zvi Galil and Giuseppe F Italiano. 1991. Data structures and algorithms for disjoint set union problems. ACM Computing Surveys (CSUR), Vol. 23, 3 (1991), 319--344.
[15]
Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).
[16]
Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. 1995. Markov chain Monte Carlo in practice .Chapman and Hall/CRC.
[17]
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences, Vol. 101, suppl 1 (2004), 5228--5235.
[18]
Xiawei Guo, Quanming Yao, WeiWei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. 2018. Privacy-preserving Transfer Learning for Knowledge Sharing. arXiv preprint arXiv:1811.09491 (2018).
[19]
Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In International Conference on Machine Learning . 555--563.
[20]
Andrew Hard, Kanishka Rao, Rajiv Mathews, Francc oise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018a. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).
[21]
Andrew Hard, Kanishka Rao, Rajiv Mathews, Francc oise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018b. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).
[22]
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677 (2017).
[23]
Morgan Harvey, Fabio Crestani, and Mark J Carman. 2013. Building user profiles from topic models for personalised search. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 2309--2314.
[24]
Di Jiang, Kenneth Wai-Ting Leung, Wilfred Ng, and Hao Li. 2013. Beyond click graph: Topic modeling for search engine query log analysis. In International Conference on Database Systems for Advanced Applications. Springer, 209--223.
[25]
Yohan Jo and Alice H Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 815--824.
[26]
Amir Karami, Aryya Gangopadhyay, Bin Zhou, and Hadi Karrazi. 2015. Flatm: A fuzzy logic approach topic model for medical documents. In 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC). IEEE, 1--6.
[27]
Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Communication, Vol. 38, 1 (2002), 19--28.
[28]
David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2018. Federated learning for keyword spotting. arXiv preprint arXiv:1810.05512 (2018).
[29]
Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexander J Smola. 2014. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 891--900.
[30]
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).
[31]
Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure Federated Transfer Learning. CoRR, Vol. abs/1812.03337 (2018). arxiv: 1812.03337 http://arxiv.org/abs/1812.03337
[32]
Jon D Mcauliffe and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems. 121--128.
[33]
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et almbox. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
[34]
Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. 2018. Exploiting unintended feature leakage in collaborative learning. In Exploiting Unintended Feature Leakage in Collaborative Learning. IEEE, 0.
[35]
David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009a. Distributed algorithms for topic models. Journal of Machine Learning Research, Vol. 10, Aug (2009), 1801--1828.
[36]
David Newman, Sarvnaz Karimi, and Lawrence Cavedon. 2009b. Using topic models to interpret MEDLINE's medical subject headings. In Australasian Joint Conference on Artificial Intelligence. Springer, 270--279.
[37]
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, Vol. 22, 10 (2010), 1345--1359.
[38]
Nicolas Papernot, Mart'in Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 (2016).
[39]
Mijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. 2016. Private topic modeling. arXiv preprint arXiv:1609.04120 (2016).
[40]
Ronald L Rivest, Len Adleman, Michael L Dertouzos, et almbox. 1978. On data banks and privacy homomorphisms. Foundations of secure computation, Vol. 4, 11 (1978), 169--180.
[41]
Thomas Rusch, Paul Hofmarcher, Reinhold Hatzinger, Kurt Hornik, et almbox. 2013. Model trees with topic model preprocessing: An approach for data journalism illustrated with the wikileaks afghanistan war logs. The Annals of Applied Statistics, Vol. 7, 2 (2013), 613--639.
[42]
Jacob M Victor. 2013. The EU general data protection regulation: Toward a property regime for protecting data privacy. Yale LJ, Vol. 123 (2013), 513.
[43]
W Gregory Voss. 2016. European union data privacy law reform: General data protection regulation, privacy shield, and the right to delisting. Business Lawyer, Vol. 72, 1 (2016), 221--233.
[44]
Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2018c. Adaptive federated learning in resource constrained edge computing systems. learning, Vol. 8 (2018), 9.
[45]
Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424--433.
[46]
Yang Wang, Quanquan Gu, and Donald Brown. 2018a. Differentially Private Hypothesis Transfer Learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 811--826.
[47]
Yu-Xiang Wang, Stephen E Fienberg, and Alexander J Smola. 2015. Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo. In ICML, Vol. 15. 2493--2502.
[48]
Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. 2018b. Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning. arXiv preprint arXiv:1812.00535 (2018).
[49]
Jonathan Wintrode and Sanjeev Khudanpur. 2014. Combining local and broad topic context to improve term detection. In Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 442--447.
[50]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 10, 2 (2019), 12.
[51]
Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the stochastic gradient sampler for large-scale lda. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 1975--1984.
[52]
Andrew Chi-Chih Yao. 1982. Protocols for secure computations. In FOCS, Vol. 82. 160--164.
[53]
Dong Yu and Li Deng. 2016. AUTOMATIC SPEECH RECOGNITION. Springer.
[54]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1351--1361.
[55]
Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In Proceedings of the 21st international conference on World Wide Web. ACM, 879--888.

Cited By

View all
  • (2023)Efficient and Secure Skyline Queries Over Vertical Data FederationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322241535:9(9269-9280)Online publication date: 1-Sep-2023
  • (2023)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science10.1007/s11704-023-3282-718:1Online publication date: 2-Dec-2023
  • (2023)Federated learning for feature-fusion based requirement classificationCluster Computing10.1007/s10586-023-04147-y27:3(3397-3416)Online publication date: 9-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
November 2019
3373 pages
ISBN:9781450369763
DOI:10.1145/3357384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bayesian networks
  2. text semantics
  3. topic model

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation of China (NSFC)

Conference

CIKM '19
Sponsor:

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient and Secure Skyline Queries Over Vertical Data FederationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322241535:9(9269-9280)Online publication date: 1-Sep-2023
  • (2023)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science10.1007/s11704-023-3282-718:1Online publication date: 2-Dec-2023
  • (2023)Federated learning for feature-fusion based requirement classificationCluster Computing10.1007/s10586-023-04147-y27:3(3397-3416)Online publication date: 9-Oct-2023
  • (2023)Improving Open-Domain Answer Sentence Selection by Distributed Clients with Privacy PreservationAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_2(15-29)Online publication date: 5-Nov-2023
  • (2022)Intrinsic Performance Influence-based Participant Contribution Estimation for Horizontal Federated LearningACM Transactions on Intelligent Systems and Technology10.1145/352305913:6(1-24)Online publication date: 22-Sep-2022
  • (2022)FedBERT: When Federated Learning Meets Pre-trainingACM Transactions on Intelligent Systems and Technology10.1145/351003313:4(1-26)Online publication date: 24-Aug-2022
  • (2022)FedCTR: Federated Native Ad CTR Prediction with Cross-platform User Behavior DataACM Transactions on Intelligent Systems and Technology10.1145/350671513:4(1-19)Online publication date: 29-Jun-2022
  • (2022)Federated Dynamic Graph Neural Networks with Secure Aggregation for Video-based Distributed SurveillanceACM Transactions on Intelligent Systems and Technology10.1145/350180813:4(1-23)Online publication date: 3-May-2022
  • (2022)Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892602(1-7)Online publication date: 18-Jul-2022
  • (2022)Manufacturing Data Privacy Protection System for Secure Predictive Maintenance2022 5th International Conference on Data Science and Information Technology (DSIT)10.1109/DSIT55514.2022.9943852(1-5)Online publication date: 22-Jul-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media