research-article

Industrial Federated Topic Modeling

Authors:

Rongzhong Lian,

Qiang YangAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 12, Issue 1

Article No.: 2, Pages 1 - 22

https://doi.org/10.1145/3418283

Published: 17 February 2021 Publication History

Abstract

Probabilistic topic modeling has been applied in a variety of industrial applications. Training a high-quality model usually requires a massive amount of data to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence, training topic models in industrial scenarios using conventional approaches faces a dilemma: A party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this article, we propose a framework named Industrial Federated Topic Modeling (iFTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immunity to privacy adversaries. iFTM is inspired by federated learning, supports two representative topic models (i.e., Latent Dirichlet Allocation and SentenceLDA) in industrial applications, and consists of novel techniques such as private Metropolis-Hastings, topic-wise normalization, and heterogeneous model integration. We conduct quantitative evaluations to verify the effectiveness of iFTM and deploy iFTM in two real-life applications to demonstrate its utility. Experimental results verify iFTM’s superiority over conventional topic modeling.

References

[1]

Corey Arnold and William Speier. 2012. A topic model of clinical reports. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1031--1032.

Digital Library

[2]

Georgios Balikas, Massih-Reza Amini, and Marianne Clausel. 2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 921--924.

Digital Library

[3]

Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-private query processing in private data federations. arXiv preprint arXiv:1810.01816 (2018).

[4]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993--1022.

Digital Library

[5]

Peter Carey. 2018. Data Protection: A Practical Guide to UK and EU Law. Oxford University Press, Inc.

Digital Library

[6]

Mark J. Carman, Fabio Crestani, Morgan Harvey, and Mark Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1849--1852.

Digital Library

[7]

Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP’10). IEEE, 5394--5397.

[8]

Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A lossless federated learning framework. CoRR abs/1901.08755 (2019).

[9]

Cynthia Dwork. 2008. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation. 1--19.

Digital Library

[10]

Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9, 3--4 (2014), 211--407.

Digital Library

[11]

James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv preprint arXiv:1603.07294 (2016).

Digital Library

[12]

Zvi Galil and Giuseppe F. Italiano. 1991. Data structures and algorithms for disjoint set union problems. ACM Comput. Surv. 23, 3 (1991), 319--344.

Digital Library

[13]

Robin C. Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).

[14]

Walter R. Gilks, Sylvia Richardson, and David Spiegelhalter. 1995. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC.

[15]

Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Na. Acad. Sci. 101, suppl 1 (2004), 5228--5235.

[16]

Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In Proceedings of the International Conference on Machine Learning. 555--563.

Digital Library

[17]

Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).

[18]

Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677 (2017).

[19]

Morgan Harvey, Fabio Crestani, and Mark J. Carman. 2013. Building user profiles from topic models for personalised search. In Proceedings of the 22nd ACM International Conference on Information 8 Knowledge Management. ACM, 2309--2314.

Digital Library

[20]

Di Jiang, Kenneth Wai-Ting Leung, Wilfred Ng, and Hao Li. 2013. Beyond click graph: Topic modeling for search engine query log analysis. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 209--223.

[21]

Di Jiang, Yuanfeng Song, Yongxin Tong, Xueyang Wu, Weiwei Zhao, Qian Xu, and Qiang Yang. 2019. Federated topic modeling. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 1071--1080.

Digital Library

[22]

Yohan Jo and Alice H. Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 815--824.

Digital Library

[23]

Amir Karami, Aryya Gangopadhyay, Bin Zhou, and Hadi Karrazi. 2015. FLATM: A fuzzy logic approach topic model for medical documents. In Proceedings of the Conference of the North American Fuzzy Information Processing Society (NAFIPS’15) held jointly with the 5th World Conference on Soft Computing (WConSC’15). IEEE, 1--6.

[24]

Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Commun. 38, 1 (2002), 19--28.

Digital Library

[25]

David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2018. Federated learning for keyword spotting. arXiv preprint arXiv:1810.05512 (2018).

[26]

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-SGD: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).

[27]

Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure federated transfer learning. CoRR abs/1812.03337 (2018).

[28]

Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 121--128.

[29]

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).

[30]

David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009. Distributed algorithms for topic models. J. Mach. Learn. 10, Aug. (2009), 1801--1828.

Digital Library

[31]

David Newman, Sarvnaz Karimi, and Lawrence Cavedon. 2009. Using topic models to interpret MEDLINE’s medical subject headings. In Proceedings of the Australasian Joint Conference on Artificial Intelligence. Springer, 270--279.

Digital Library

[32]

Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.

Digital Library

[33]

Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 (2016).

[34]

Mijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. 2016. Private topic modeling. arXiv preprint arXiv:1609.04120 (2016).

[35]

Ronald L. Rivest, Len Adleman, Michael L. Dertouzos, et al. 1978. On data banks and privacy homomorphisms. Found. Sec. Comput. 4, 11 (1978), 169--180.

[36]

Thomas Rusch, Paul Hofmarcher, Reinhold Hatzinger, Kurt Hornik, et al. 2013. Model trees with topic model preprocessing: An approach for data journalism illustrated with the Wikileaks Afghanistan war logs. Ann. Appl. Statist. 7, 2 (2013), 613--639.

[37]

Jacob M. Victor. 2013. The EU general data protection regulation: Toward a property regime for protecting data privacy. Yale LJ 123 (2013), 513.

[38]

Paul Voigt and Axel Von dem Bussche. 2017. The EU General Data Protection Regulation (GDPR). A Practical Guide, 1st (ed.). Springer International Publishing, Cham.

Digital Library

[39]

Jan Vosecky, Di Jiang, Kenneth Wai-Ting Leung, Kai Xing, and Wilfred Ng. 2014. Integrating social and auxiliary semantics for multifaceted topic modeling in Twitter. ACM Trans. Internet Technol. 14, 4 (2014), 27.

Digital Library

[40]

W. Gregory Voss. 2016. European Union data privacy law reform: General data protection regulation, privacy shield, and the right to delisting. Bus. Law. 72, 1 (2016), 221--233.

[41]

Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 424--433.

Digital Library

[42]

Yang Wang, Quanquan Gu, and Donald Brown. 2018. Differentially private hypothesis transfer learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 811--826.

[43]

Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. 2015. Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In Proceedings of the International Conference on Machine Learning (ICML’15), Vol. 15. 2493--2502.

Digital Library

[44]

Jonathan Wintrode and Sanjeev Khudanpur. 2014. Combining local and broad topic context to improve term detection. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 442--447.

[45]

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 12.

Digital Library

[46]

Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the stochastic gradient sampler for large-scale LDA. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1975--1984.

Digital Library

[47]

Andrew Chi-Chih Yao. 1982. Protocols for secure computations. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS’82), Vol. 82. 160--164.

Digital Library

[48]

Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1351--1361.

Digital Library

[49]

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web. ACM, 879--888.

Digital Library

Cited By

Liu FZheng ZShi YTong YZhang Y(2024)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3282-718:1Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11704-023-3282-7
Liu QWu JHuang ZWang HNing YChen MChen EYi JZhou B(2023)Federated User Modeling from Hierarchical InformationACM Transactions on Information Systems10.1145/356048541:2(1-33)Online publication date: 9-Feb-2023
https://dl.acm.org/doi/10.1145/3560485
Rasha ALi THuang WGu JLi C(2023)Federated Learning in Smart Cities: Privacy and Security SurveyInformation Sciences10.1016/j.ins.2023.03.033Online publication date: Mar-2023
https://doi.org/10.1016/j.ins.2023.03.033
Show More Cited By

Index Terms

Industrial Federated Topic Modeling
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A joint model for topic-sentiment modeling from text
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Traditional topic models, like LDA and PLSA, have been efficiently extended to capture further aspects of text in addition to the latent topics (e.g., time evolution, sentiment etc.). In this paper, we discuss the issue of joint topic-sentiment ...
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Modeling online reviews with multi-grain topic models
WWW '08: Proceedings of the 17th international conference on World Wide Web

In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 12, Issue 1

Regular Papers

February 2021

280 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3436534

Editor:
Yu Zheng
JD Digits, China

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Accepted: 01 July 2020

Revised: 01 May 2020

Received: 01 November 2019

Published in TIST Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

State Key Laboratory of Software Development Environment (Beihang University) Open Program
National Science Foundation of China (NSFC)
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
387
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu FZheng ZShi YTong YZhang Y(2024)A survey on federated learning: a perspective from multi-party computationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3282-718:1Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11704-023-3282-7
Liu QWu JHuang ZWang HNing YChen MChen EYi JZhou B(2023)Federated User Modeling from Hierarchical InformationACM Transactions on Information Systems10.1145/356048541:2(1-33)Online publication date: 9-Feb-2023
https://dl.acm.org/doi/10.1145/3560485
Rasha ALi THuang WGu JLi C(2023)Federated Learning in Smart Cities: Privacy and Security SurveyInformation Sciences10.1016/j.ins.2023.03.033Online publication date: Mar-2023
https://doi.org/10.1016/j.ins.2023.03.033
Tang XLiang YWang GChen W(2023)Assisted driving system based on federated reinforcement learningDisplays10.1016/j.displa.2023.10254780(102547)Online publication date: Dec-2023
https://doi.org/10.1016/j.displa.2023.102547
Ning BZhao DLiu XLi G(2023)EAGS: An extracting auxiliary knowledge graph model in multi-turn dialogue generationWorld Wide Web10.1007/s11280-022-01100-826:4(1545-1566)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s11280-022-01100-8
Guo LGao WCao YLai X(2022)Research on medical data security sharing scheme based on homomorphic encryptionMathematical Biosciences and Engineering10.3934/mbe.202310620:2(2261-2279)Online publication date: 2022
https://doi.org/10.3934/mbe.2023106
梁锋羊恩潘微杨强明仲(2022)Survey of recommender systems based on federated learningSCIENTIA SINICA Informationis10.1360/SSI-2021-032952:5(713)Online publication date: 12-May-2022
https://doi.org/10.1360/SSI-2021-0329
Hu ZXie HYu LGao XShang ZZhang Y(2022)Dynamic-Aware Federated Learning for Face Forgery Video DetectionACM Transactions on Intelligent Systems and Technology10.1145/350181413:4(1-25)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.1145/3501814
Wang GTang XXu LChen W(2022)Efficient and secure pedestrian detection in intelligent vehicles based on federated learning2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860748(1-5)Online publication date: Jun-2022
https://doi.org/10.1109/VTC2022-Spring54318.2022.9860748

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents