Abstract
Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method combines the renormalization theory and the Renyi entropy approach. The main advantage of this method is computational speed which is crucial when dealing with big data. We apply our method to Latent Dirichlet Allocation model with Gibbs sampling procedure and test our approach on two datasets in different languages. Numerical results and comparison of computational speed demonstrate a significant gain in time with respect to standard grid search methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wallach, H.M., Mimno, D., McCallum, A.: Rethinking LDA: why priors matter. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 1973–1981. Curran Associates Inc., USA (2009)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics, Stroudsburg (2011)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM, New York (2015)
Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961. Association for Computational Linguistics, Stroudsburg (2012)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006). https://doi.org/10.1198/016214506000000302
Koltsov, S.: Application of Rényi and Tsallis entropies to topic modeling optimization. Phys. A 512, 1192–1204 (2018). https://doi.org/10.1016/j.physa.2018.08.050
Ignatenko, V., Koltcov, S., Staab, S., Boukhers, Z.: Fractal approach for determining the optimal number of topics in the field of topic modeling. J. Phys: Conf. Ser. 1163, 012025 (2019). https://doi.org/10.1088/1742-6596/1163/1/012025
Koltsov, S., Ignatenko, V., Koltsova, O.: Estimating topic modeling performance with Sharma-Mittal entropy. Entropy 21(7), 1–29 (2019). https://doi.org/10.3390/e21070660
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)
Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323 (2015). https://doi.org/10.1007/s10994-014-5476-6
Kadanoff, L.P.: Statistical Physics: Statics. Dynamics and Renormalization. World Scientific, Singapore (2000)
Wilson, K.G.: Renormalization group and critical phenomena. I renormalization group and the Kadanoff scaling picture. Phys. Rev. B 4(9), 3174–3183 (1971). https://doi.org/10.1103/PhysRevB.4.3174
Olemskoi, A.I.: Synergetics of Complex Systems: Phenomenology and Statistical Theory. Krasand, Moscow (2009)
Carpinteri, A., Chiaia, B.: Multifractal nature of concrete fracture surfaces and size effects on nominal fracture energy. Mater. Struct. 28(8), 435–443 (1995). https://doi.org/10.1007/BF02473162
Essam, J.W.: Potts models, percolation, and duality. J. Math. Phys. 20(8), 1769–1773 (1979). https://doi.org/10.1063/1.524264
Wilson, K.G., Kogut, J.: The renormalization group and the \(\in \) expansion. Phys. Rep. 12(2), 75–199 (1974). https://doi.org/10.1016/0370-1573(74)90023-4
Beck, C.: Generalised information and entropy measures in physics. Contemp. Phys. 50, 495–510 (2009). https://doi.org/10.1080/00107510902823517
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis. 1st edn. Lawrence Erlbaum Associates, Mahwah (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
News dataset from Lenta.ru. https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta
Balanced subset of news dataset from Lenta.ru. https://yadi.sk/i/RgBMt7lJLK9gfg
20 Newsgroups dataset. http://qwone.com/jason/20Newsgroups/
Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications, 1st edn. Chapman and Hall, New York (2008)
Acknowledgments
The study was implemented in the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) in 2019.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Koltcov, S., Ignatenko, V. (2020). Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1228. Springer, Cham. https://doi.org/10.1007/978-3-030-52249-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-52249-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-52248-3
Online ISBN: 978-3-030-52249-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)