Skip to main content

A Distributed Topic Model for Large-Scale Streaming Text

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11776))

Abstract

Learning topic information from large-scale unstructured text has attracted extensive attention from both the academia and industry. Topic models, such as LDA and its variants, are a popular machine learning technique to discover such latent structure. Among them, online variational hierarchical Dirichlet process (onlineHDP) is a promising candidate for dynamically processing streaming text. Instead of a static assignment in advance, the number of topics in onlineHDP is inferred from the corpus as the training process proceeds. However, when dealing with large scale streaming data it still suffers from the limited model capacity problem. To this end, we proposed a distributed version of the onlineHDP algorithm (named as DistHDP) in this paper, the training task is split into many sub-batch tasks and distributed across multiple worker nodes, such that the whole training process is accelerated. The model convergence is guaranteed through a distributed variation inference algorithm. Extensive experiments conducted on several real-world datasets demonstrate the usability and scalability of the proposed algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Change history

  • 20 August 2019

    The book was inadvertently published with an uncorrected version. The following corrections should have been carried out before publication:

    1. 1.

      Page 42: Sentence “By computing the natural gradient, corpus-level parameters are updated according to Eq. (Error! Reference source not found)–Eq. (17).” Correctly it should read “By computing the natural gradient, corpus-level parameters are updated according to Eqs. (17)–(19).”

Notes

  1. 1.

    https://webhose.io/.

  2. 2.

    https://www.ncbi.nlm.nih.gov/pubmed/.

  3. 3.

    http://www.wanfangdata.com.cn/index.html.

  4. 4.

    https://www.cnki.net/.

  5. 5.

    https://github.com/blei-lab/lda-c.

References

  1. Sato, M.: Online model selection based on the variational Bayes. Neural Comput. 13(7), 1649–1681 (2005)

    Article  Google Scholar 

  2. She, J., Chen, L.: Tomoha: topic model-based hashtag recommendation on twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)

    Google Scholar 

  3. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2016)

    Google Scholar 

  4. Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: SIGKDD, pp. 937–946 (2009)

    Google Scholar 

  5. Yuan, J., et al.: LightLDA: big topic models on modest compute clusters. In: The International World Wide Web Conference, pp. 1351–1361 (2015)

    Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)

    MATH  Google Scholar 

  7. Lele, Yu., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)

    Article  Google Scholar 

  8. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  MathSciNet  Google Scholar 

  9. Tang, Y.-K., Mao, X.-L., Huang, H., Shi, X., Wen, G.: Conceptualization topic modeling. Multimedi Tools Appl. 3(77), 3455–3471 (2018)

    Article  Google Scholar 

  10. Wang, C., Paisley, J., Blei, D.M.: Online variational inference for the hierarchical Dirichlet process. In: 14th International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)

    Google Scholar 

  11. Chen, J., Li, K., Zhu, J., Chen, W.: WrapLDA: a cache efficient O(1) algorithm for Latent Dirichlet allocation. Proc. VLDB Endow. 9(10), 744–755 (2016)

    Article  Google Scholar 

  12. Li, A.Q., Ahmed, A., Ravi, S., Smola, A.J.: Reducing the sampling complexity of topic models. In: SIGKDD, pp. 891–900 (2014)

    Google Scholar 

  13. Yu, H.-F., Hsieh, C.-J., Yun, H., Vishwanathan, S., Dhillon, I.S.: A scalable asynchronous distributed algorithm for topic modeling. In: The International World Wide Web Conference, pp. 1340–1350 (2015)

    Google Scholar 

  14. Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539 (2014)

    Google Scholar 

  15. Fu, X., et al.: Dynamic Online HDP model for discovering evolutionary topics from Chinese social texts. Neurocomputing 171, 412–424 (2016)

    Article  Google Scholar 

  16. Internet Live Stats, Twitter Usage Statistics page. https://www.internetlivestats.com/twitter-statistics. Accessed 01 June 2019

  17. Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)

    MathSciNet  MATH  Google Scholar 

  18. U.S. National Library of Medicine, Download PubMed Data. https://www.nlm.nih.gov/databases/download/pubmed_medline.html. Accessed 05 June 2019

  19. Canini, K., Shi, L., Griffiths, T.: Online inference of topics with latent Dirichlet allocation. In: Artificial Intelligence and Statistics, pp. 65–72 (2009)

    Google Scholar 

  20. Fox, E., Sudderth, E., Jordan, M., et al.: An HDP-HMM for systems with state persistence. In: 25th International Conference on Machine Learning, pp. 312–319 (2008)

    Google Scholar 

Download references

Acknowledgments

This work is sponsored by the National Key R&D Program of China [grant number 2018YFB0204300].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dawei Feng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Feng, D., Lu, M., Li, D. (2019). A Distributed Topic Model for Large-Scale Streaming Text. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds) Knowledge Science, Engineering and Management. KSEM 2019. Lecture Notes in Computer Science(), vol 11776. Springer, Cham. https://doi.org/10.1007/978-3-030-29563-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29563-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29562-2

  • Online ISBN: 978-3-030-29563-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics