A Distributed Topic Model for Large-Scale Streaming Text

Li, Yicong; Feng, Dawei; Lu, Menglong; Li, Dongsheng

doi:10.1007/978-3-030-29563-9_4

A Distributed Topic Model for Large-Scale Streaming Text

Yicong Li ORCID: orcid.org/0000-0001-7905-4885^11,12,
Dawei Feng^11,12,
Menglong Lu^11,12 &
…
Dongsheng Li^11,12

Conference paper
First Online: 22 August 2019

1248 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11776))

The original version of this chapter was revised: References and equation citations were modified. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-29563-9_37

Abstract

Learning topic information from large-scale unstructured text has attracted extensive attention from both the academia and industry. Topic models, such as LDA and its variants, are a popular machine learning technique to discover such latent structure. Among them, online variational hierarchical Dirichlet process (onlineHDP) is a promising candidate for dynamically processing streaming text. Instead of a static assignment in advance, the number of topics in onlineHDP is inferred from the corpus as the training process proceeds. However, when dealing with large scale streaming data it still suffers from the limited model capacity problem. To this end, we proposed a distributed version of the onlineHDP algorithm (named as DistHDP) in this paper, the training task is split into many sub-batch tasks and distributed across multiple worker nodes, such that the whole training process is accelerated. The model convergence is guaranteed through a distributed variation inference algorithm. Extensive experiments conducted on several real-world datasets demonstrate the usability and scalability of the proposed algorithm.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Change history

20 August 2019
The book was inadvertently published with an uncorrected version. The following corrections should have been carried out before publication:
1. 1.
  Page 42: Sentence “By computing the natural gradient, corpus-level parameters are updated according to Eq. (Error! Reference source not found)–Eq. (17).” Correctly it should read “By computing the natural gradient, corpus-level parameters are updated according to Eqs. (17)–(19).”

Notes

References

Sato, M.: Online model selection based on the variational Bayes. Neural Comput. 13(7), 1649–1681 (2005)
Article Google Scholar
She, J., Chen, L.: Tomoha: topic model-based hashtag recommendation on twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)
Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2016)
Google Scholar
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: SIGKDD, pp. 937–946 (2009)
Google Scholar
Yuan, J., et al.: LightLDA: big topic models on modest compute clusters. In: The International World Wide Web Conference, pp. 1351–1361 (2015)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Lele, Yu., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)
Article Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Article MathSciNet Google Scholar
Tang, Y.-K., Mao, X.-L., Huang, H., Shi, X., Wen, G.: Conceptualization topic modeling. Multimedi Tools Appl. 3(77), 3455–3471 (2018)
Article Google Scholar
Wang, C., Paisley, J., Blei, D.M.: Online variational inference for the hierarchical Dirichlet process. In: 14th International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)
Google Scholar
Chen, J., Li, K., Zhu, J., Chen, W.: WrapLDA: a cache efficient O(1) algorithm for Latent Dirichlet allocation. Proc. VLDB Endow. 9(10), 744–755 (2016)
Article Google Scholar
Li, A.Q., Ahmed, A., Ravi, S., Smola, A.J.: Reducing the sampling complexity of topic models. In: SIGKDD, pp. 891–900 (2014)
Google Scholar
Yu, H.-F., Hsieh, C.-J., Yun, H., Vishwanathan, S., Dhillon, I.S.: A scalable asynchronous distributed algorithm for topic modeling. In: The International World Wide Web Conference, pp. 1340–1350 (2015)
Google Scholar
Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539 (2014)
Google Scholar
Fu, X., et al.: Dynamic Online HDP model for discovering evolutionary topics from Chinese social texts. Neurocomputing 171, 412–424 (2016)
Article Google Scholar
Internet Live Stats, Twitter Usage Statistics page. https://www.internetlivestats.com/twitter-statistics. Accessed 01 June 2019
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
MathSciNet MATH Google Scholar
U.S. National Library of Medicine, Download PubMed Data. https://www.nlm.nih.gov/databases/download/pubmed_medline.html. Accessed 05 June 2019
Canini, K., Shi, L., Griffiths, T.: Online inference of topics with latent Dirichlet allocation. In: Artificial Intelligence and Statistics, pp. 65–72 (2009)
Google Scholar
Fox, E., Sudderth, E., Jordan, M., et al.: An HDP-HMM for systems with state persistence. In: 25th International Conference on Machine Learning, pp. 312–319 (2008)
Google Scholar

Download references

Acknowledgments

This work is sponsored by the National Key R&D Program of China [grant number 2018YFB0204300].

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha, 410073, China
Yicong Li, Dawei Feng, Menglong Lu & Dongsheng Li
National Key Laboratory of Parallel and Distributed Processing, National University of Defense Technology, Changsha, 410073, China
Yicong Li, Dawei Feng, Menglong Lu & Dongsheng Li

Authors

Yicong Li
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Menglong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dawei Feng .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Christos Douligeris
University of Vienna, Vienna, Austria
Dimitris Karagiannis
University of Piraeus, Piraeus, Greece
Dimitris Apostolou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Feng, D., Lu, M., Li, D. (2019). A Distributed Topic Model for Large-Scale Streaming Text. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds) Knowledge Science, Engineering and Management. KSEM 2019. Lecture Notes in Computer Science(), vol 11776. Springer, Cham. https://doi.org/10.1007/978-3-030-29563-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-29563-9_4
Published: 22 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29562-2
Online ISBN: 978-3-030-29563-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics