Skip to main content

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2015)

Abstract

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this paper we announce the BigARTM open source project (http://bigartm.org) for regularized multimodal topic modeling of large collections. Several experiments on Wikipedia corpus show that BigARTM performs faster and gives better perplexity comparing to other popular packages, such as Vowpal Wabbit and Gensim. We also demonstrate several unique BigARTM features, such as additive combination of regularizers, topic sparsing and decorrelation, multimodal and multilanguage modeling, which are not available in the other software packages for topic modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://stackoverflow.com/questions/2538070/atomic-operation-cost.

  2. 2.

    http://en.wikipedia.org/wiki/Denormal_number#Performance_issues.

  3. 3.

    http://code.google.com/p/protobuf/.

  4. 4.

    https://github.com/JohnLangford/vowpal_wabbit/.

  5. 5.

    http://dumps.wikimedia.org/enwiki/20141208/.

  6. 6.

    https://github.com/piskvorky/gensim/tree/develop/gensim/scripts/.

  7. 7.

    http://dumps.wikimedia.org/ruwiki/20141203/.

  8. 8.

    https://tech.yandex.ru/mystem/.

References

  1. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

  2. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM, New York (2003)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. China 4(2), 280–301 (2010)

    Article  Google Scholar 

  5. Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent dirichlet allocation. In: NIPS, pp. 856–864. Curran Associates Inc. (2010)

    Google Scholar 

  6. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)

    Google Scholar 

  7. Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2(3), 26:1–26:18 (2011)

    Article  Google Scholar 

  8. Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801–1828 (2009)

    MathSciNet  MATH  Google Scholar 

  9. Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88(1–2), 157–208 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. Proc. VLDB Endow. 3(1–2), 703–710 (2010)

    Article  Google Scholar 

  11. Vorontsov, K.V.: Additive regularization for topic models of text collections. Dokl. Math. 89(3), 301–304 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Vorontsov, K.V., Potapenko, A.A.: Additive regularization of topic models. Mach. Learn. 101(1–3), 303–323 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  13. Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorsky, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Heidelberg (2014)

    Google Scholar 

  14. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, pp. 45–50, May 2010

    Google Scholar 

  15. Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: parallel latent dirichlet allocation for large-scale applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Acknowledgements

The work was supported by the Russian Foundation for Basic Research grants 14-07-00847, 14-07-00908, 14-07-31176 and by Skolkovo Institute of Science and Technology (project 081-R).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantin Vorontsov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M. (2015). BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26123-2_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26122-5

  • Online ISBN: 978-3-319-26123-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics