Skip to main content

Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1228))

Included in the following conference series:

  • 1106 Accesses

Abstract

Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method combines the renormalization theory and the Renyi entropy approach. The main advantage of this method is computational speed which is crucial when dealing with big data. We apply our method to Latent Dirichlet Allocation model with Gibbs sampling procedure and test our approach on two datasets in different languages. Numerical results and comparison of computational speed demonstrate a significant gain in time with respect to standard grid search methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wallach, H.M., Mimno, D., McCallum, A.: Rethinking LDA: why priors matter. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 1973–1981. Curran Associates Inc., USA (2009)

    Google Scholar 

  2. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  3. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  4. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM, New York (2015)

    Google Scholar 

  5. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  6. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006). https://doi.org/10.1198/016214506000000302

    Article  MathSciNet  MATH  Google Scholar 

  7. Koltsov, S.: Application of Rényi and Tsallis entropies to topic modeling optimization. Phys. A 512, 1192–1204 (2018). https://doi.org/10.1016/j.physa.2018.08.050

    Article  Google Scholar 

  8. Ignatenko, V., Koltcov, S., Staab, S., Boukhers, Z.: Fractal approach for determining the optimal number of topics in the field of topic modeling. J. Phys: Conf. Ser. 1163, 012025 (2019). https://doi.org/10.1088/1742-6596/1163/1/012025

    Article  Google Scholar 

  9. Koltsov, S., Ignatenko, V., Koltsova, O.: Estimating topic modeling performance with Sharma-Mittal entropy. Entropy 21(7), 1–29 (2019). https://doi.org/10.3390/e21070660

    Article  Google Scholar 

  10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)

    Google Scholar 

  11. Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323 (2015). https://doi.org/10.1007/s10994-014-5476-6

    Article  MathSciNet  MATH  Google Scholar 

  12. Kadanoff, L.P.: Statistical Physics: Statics. Dynamics and Renormalization. World Scientific, Singapore (2000)

    Book  Google Scholar 

  13. Wilson, K.G.: Renormalization group and critical phenomena. I renormalization group and the Kadanoff scaling picture. Phys. Rev. B 4(9), 3174–3183 (1971). https://doi.org/10.1103/PhysRevB.4.3174

    Article  MATH  Google Scholar 

  14. Olemskoi, A.I.: Synergetics of Complex Systems: Phenomenology and Statistical Theory. Krasand, Moscow (2009)

    Google Scholar 

  15. Carpinteri, A., Chiaia, B.: Multifractal nature of concrete fracture surfaces and size effects on nominal fracture energy. Mater. Struct. 28(8), 435–443 (1995). https://doi.org/10.1007/BF02473162

    Article  Google Scholar 

  16. Essam, J.W.: Potts models, percolation, and duality. J. Math. Phys. 20(8), 1769–1773 (1979). https://doi.org/10.1063/1.524264

    Article  Google Scholar 

  17. Wilson, K.G., Kogut, J.: The renormalization group and the \(\in \) expansion. Phys. Rep. 12(2), 75–199 (1974). https://doi.org/10.1016/0370-1573(74)90023-4

    Article  Google Scholar 

  18. Beck, C.: Generalised information and entropy measures in physics. Contemp. Phys. 50, 495–510 (2009). https://doi.org/10.1080/00107510902823517

    Article  Google Scholar 

  19. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis. 1st edn. Lawrence Erlbaum Associates, Mahwah (2007)

    Google Scholar 

  20. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  21. News dataset from Lenta.ru. https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta

  22. Balanced subset of news dataset from Lenta.ru. https://yadi.sk/i/RgBMt7lJLK9gfg

  23. 20 Newsgroups dataset. http://qwone.com/jason/20Newsgroups/

  24. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications, 1st edn. Chapman and Hall, New York (2008)

    Book  Google Scholar 

Download references

Acknowledgments

The study was implemented in the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) in 2019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vera Ignatenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koltcov, S., Ignatenko, V. (2020). Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1228. Springer, Cham. https://doi.org/10.1007/978-3-030-52249-0_16

Download citation

Publish with us

Policies and ethics