research-article

Federated Topic Modeling

Authors:

Qiang YangAuthors Info & Claims

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Pages 1071 - 1080

https://doi.org/10.1145/3357384.3357909

Published: 03 November 2019 Publication History

Abstract

Topic modeling has been widely applied in a variety of industrial applications. Training a high-quality model usually requires massive amount of in-domain data, in order to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence training topic models in industrial scenarios using conventional approaches faces a dilemma: a party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this paper, we propose a novel framework named Federated Topic Modeling (FTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immune to privacy adversaries. FTM is inspired by federated learning and consists of novel techniques such as private Metropolis Hastings, topic-wise normalization and heterogeneous model integration. We conduct a series of quantitative evaluations to verify the effectiveness of FTM and deploy FTM in an Automatic Speech Recognition (ASR) system to demonstrate its utility in real-life applications. Experimental results verify FTM's superiority over conventional topic modeling.

References

[1]

Corey Arnold and William Speier. 2012. A topic model of clinical reports. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 1031--1032.

Digital Library

[2]

Georgios Balikas, Massih-Reza Amini, and Marianne Clausel. 2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 921--924.

Digital Library

[3]

Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-Private Query Processing in Private Data Federations. arXiv preprint arXiv:1810.01816 (2018).

[4]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, Vol. 3, Jan (2003), 993--1022.

Digital Library

[5]

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2016. Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482 (2016).

[6]

Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. 2018. Federated learning of predictive models from federated Electronic Health Records. International journal of medical informatics, Vol. 112 (2018), 59--67.

[7]

Peter Carey. 2018. Data protection: a practical guide to UK and EU law .Oxford University Press, Inc.

Digital Library

[8]

Mark J Carman, Fabio Crestani, Morgan Harvey, and Mark Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 1849--1852.

Digital Library

[9]

Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 5394--5397.

[10]

Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A Lossless Federated Learning Framework. CoRR, Vol. abs/1901.08755 (2019). arxiv: 1901.08755 http://arxiv.org/abs/1901.08755

[11]

Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi'an, China, April 25--29, 2008. Proceedings. 1--19.

[12]

Cynthia Dwork, Aaron Roth, et almbox. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, Vol. 9, 3--4 (2014), 211--407.

[13]

James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv preprint arXiv:1603.07294 (2016).

Digital Library

[14]

Zvi Galil and Giuseppe F Italiano. 1991. Data structures and algorithms for disjoint set union problems. ACM Computing Surveys (CSUR), Vol. 23, 3 (1991), 319--344.

Digital Library

[15]

Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).

[16]

Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. 1995. Markov chain Monte Carlo in practice .Chapman and Hall/CRC.

[17]

Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences, Vol. 101, suppl 1 (2004), 5228--5235.

[18]

Xiawei Guo, Quanming Yao, WeiWei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. 2018. Privacy-preserving Transfer Learning for Knowledge Sharing. arXiv preprint arXiv:1811.09491 (2018).

[19]

Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In International Conference on Machine Learning . 555--563.

[20]

Andrew Hard, Kanishka Rao, Rajiv Mathews, Francc oise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018a. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).

[21]

Andrew Hard, Kanishka Rao, Rajiv Mathews, Francc oise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018b. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).

[22]

Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677 (2017).

[23]

Morgan Harvey, Fabio Crestani, and Mark J Carman. 2013. Building user profiles from topic models for personalised search. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 2309--2314.

Digital Library

[24]

Di Jiang, Kenneth Wai-Ting Leung, Wilfred Ng, and Hao Li. 2013. Beyond click graph: Topic modeling for search engine query log analysis. In International Conference on Database Systems for Advanced Applications. Springer, 209--223.

[25]

Yohan Jo and Alice H Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 815--824.

Digital Library

[26]

Amir Karami, Aryya Gangopadhyay, Bin Zhou, and Hadi Karrazi. 2015. Flatm: A fuzzy logic approach topic model for medical documents. In 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC). IEEE, 1--6.

[27]

Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Communication, Vol. 38, 1 (2002), 19--28.

Digital Library

[28]

David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2018. Federated learning for keyword spotting. arXiv preprint arXiv:1810.05512 (2018).

[29]

Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexander J Smola. 2014. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 891--900.

Digital Library

[30]

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).

[31]

Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure Federated Transfer Learning. CoRR, Vol. abs/1812.03337 (2018). arxiv: 1812.03337 http://arxiv.org/abs/1812.03337

[32]

Jon D Mcauliffe and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems. 121--128.

[33]

H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et almbox. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).

[34]

Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. 2018. Exploiting unintended feature leakage in collaborative learning. In Exploiting Unintended Feature Leakage in Collaborative Learning. IEEE, 0.

[35]

David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009a. Distributed algorithms for topic models. Journal of Machine Learning Research, Vol. 10, Aug (2009), 1801--1828.

[36]

David Newman, Sarvnaz Karimi, and Lawrence Cavedon. 2009b. Using topic models to interpret MEDLINE's medical subject headings. In Australasian Joint Conference on Artificial Intelligence. Springer, 270--279.

Digital Library

[37]

Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, Vol. 22, 10 (2010), 1345--1359.

Digital Library

[38]

Nicolas Papernot, Mart'in Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 (2016).

[39]

Mijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. 2016. Private topic modeling. arXiv preprint arXiv:1609.04120 (2016).

[40]

Ronald L Rivest, Len Adleman, Michael L Dertouzos, et almbox. 1978. On data banks and privacy homomorphisms. Foundations of secure computation, Vol. 4, 11 (1978), 169--180.

[41]

Thomas Rusch, Paul Hofmarcher, Reinhold Hatzinger, Kurt Hornik, et almbox. 2013. Model trees with topic model preprocessing: An approach for data journalism illustrated with the wikileaks afghanistan war logs. The Annals of Applied Statistics, Vol. 7, 2 (2013), 613--639.

[42]

Jacob M Victor. 2013. The EU general data protection regulation: Toward a property regime for protecting data privacy. Yale LJ, Vol. 123 (2013), 513.

[43]

W Gregory Voss. 2016. European union data privacy law reform: General data protection regulation, privacy shield, and the right to delisting. Business Lawyer, Vol. 72, 1 (2016), 221--233.

[44]

Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2018c. Adaptive federated learning in resource constrained edge computing systems. learning, Vol. 8 (2018), 9.

[45]

Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424--433.

Digital Library

[46]

Yang Wang, Quanquan Gu, and Donald Brown. 2018a. Differentially Private Hypothesis Transfer Learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 811--826.

[47]

Yu-Xiang Wang, Stephen E Fienberg, and Alexander J Smola. 2015. Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo. In ICML, Vol. 15. 2493--2502.

[48]

Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. 2018b. Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning. arXiv preprint arXiv:1812.00535 (2018).

[49]

Jonathan Wintrode and Sanjeev Khudanpur. 2014. Combining local and broad topic context to improve term detection. In Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 442--447.

[50]

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 10, 2 (2019), 12.

Digital Library

[51]

Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the stochastic gradient sampler for large-scale lda. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 1975--1984.

Digital Library

[52]

Andrew Chi-Chih Yao. 1982. Protocols for secure computations. In FOCS, Vol. 82. 160--164.

[53]

Dong Yu and Li Deng. 2016. AUTOMATIC SPEECH RECOGNITION. Springer.

[54]

Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1351--1361.

Digital Library

[55]

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In Proceedings of the 21st international conference on World Wide Web. ACM, 879--888.

Digital Library

Cited By

Zhang YShi YZhou ZXue CXu YXu KDu J(2023)Efficient and Secure Skyline Queries Over Vertical Data FederationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322241535:9(9269-9280)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2022.3222415
Liu FZheng ZShi YTong YZhang Y(2023)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science10.1007/s11704-023-3282-718:1Online publication date: 2-Dec-2023
https://doi.org/10.1007/s11704-023-3282-7
Wang RLiu JZhang QFu Chou Y(2023)Federated learning for feature-fusion based requirement classificationCluster Computing10.1007/s10586-023-04147-y27:3(3397-3416)Online publication date: 9-Oct-2023
https://doi.org/10.1007/s10586-023-04147-y
Show More Cited By

Index Terms

Federated Topic Modeling
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition

Public opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

November 2019

3373 pages

ISBN:9781450369763

DOI:10.1145/3357384

General Chairs:
Wenwu Zhu
Tsinghua University, China
,
Dacheng Tao
University of Massachusetts, USA
,
Xueqi Cheng
Institute of Computing Technology, CAS, China
,
Program Chairs:
Peng Cui
Tsinghua University, China
,
Elke Rundensteiner
Worcester Polytechnic Institute, USA
,
David Carmel
Amazon Research, USA
,
Qi He
LinkedIn, USA
,
Jeffrey Xu Yu
Chinese University of Hong Kong, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation of China (NSFC)

Conference

CIKM '19

Sponsor:

CIKM '19: The 28th ACM International Conference on Information and Knowledge Management

November 3 - 7, 2019

Beijing, China

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
932
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YShi YZhou ZXue CXu YXu KDu J(2023)Efficient and Secure Skyline Queries Over Vertical Data FederationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322241535:9(9269-9280)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2022.3222415
Liu FZheng ZShi YTong YZhang Y(2023)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science10.1007/s11704-023-3282-718:1Online publication date: 2-Dec-2023
https://doi.org/10.1007/s11704-023-3282-7
Wang RLiu JZhang QFu Chou Y(2023)Federated learning for feature-fusion based requirement classificationCluster Computing10.1007/s10586-023-04147-y27:3(3397-3416)Online publication date: 9-Oct-2023
https://doi.org/10.1007/s10586-023-04147-y
Wang WShen TBlumenstein MLong G(2023)Improving Open-Domain Answer Sentence Selection by Distributed Clients with Privacy PreservationAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_2(15-29)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-46677-9_2
Zhang LFan LLuo YDuan L(2022)Intrinsic Performance Influence-based Participant Contribution Estimation for Horizontal Federated LearningACM Transactions on Intelligent Systems and Technology10.1145/352305913:6(1-24)Online publication date: 22-Sep-2022
https://dl.acm.org/doi/10.1145/3523059
Tian YWan YLyu LYao DJin HSun L(2022)FedBERT: When Federated Learning Meets Pre-trainingACM Transactions on Intelligent Systems and Technology10.1145/351003313:4(1-26)Online publication date: 24-Aug-2022
https://dl.acm.org/doi/10.1145/3510033
Wu CWu FLyu LHuang YXie X(2022)FedCTR: Federated Native Ad CTR Prediction with Cross-platform User Behavior DataACM Transactions on Intelligent Systems and Technology10.1145/350671513:4(1-19)Online publication date: 29-Jun-2022
https://dl.acm.org/doi/10.1145/3506715
Jiang MJung TKarl RZhao T(2022)Federated Dynamic Graph Neural Networks with Secure Aggregation for Video-based Distributed SurveillanceACM Transactions on Intelligent Systems and Technology10.1145/350180813:4(1-23)Online publication date: 3-May-2022
https://dl.acm.org/doi/10.1145/3501808
Si SWang JZhang RSu QXiao J(2022)Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892602(1-7)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892602
Ao YJiang Y(2022)Manufacturing Data Privacy Protection System for Secure Predictive Maintenance2022 5th International Conference on Data Science and Information Technology (DSIT)10.1109/DSIT55514.2022.9943852(1-5)Online publication date: 22-Jul-2022
https://doi.org/10.1109/DSIT55514.2022.9943852
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten