Abstract
This paper proposes reexamining ancestors of modern topic modeling technique that seem to have been forgotten. We present an experiment where results obtained using six contemporary techniques are compared with a factorization technique developed in the early sixties and a contemporary adaptation of it based on non-negative matrix factorization. Results on internal and external coherence as well as topic diversity suggest that extracting topics by applying factorization methods on a word-by-word correlation matrix computed on documents segmented into smaller contextual windows produces topics that are clearly more coherent and show higher diversity than other topic modeling techniques using term-document matrices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
The two additional datasets are available from https://provalisresearch.com/tm/datasets.zip.
- 7.
References
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: IWCS, vol. 13, pp. 13–22 (2013)
AlSumait, L., Barbará, D., Gentle, J., Domeniconi, C.: Topic significance ranking of LDA generative models. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 67–82. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8_22
Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. In: IEEE 53rd Annual Symposium on Foundations of Computer Science, pp. 1–10. IEEE (2012)
Bianchi, F., Terragni, S., Hovy, D: Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 759–766. ACL (2021)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Borko, H.: The construction of an empirically based mathematically derived classification system. In: Proceedings of the Spring Joint Computer Conference, vol. 21, pp. 279–289 (1962)
Borko, H., Bernick, M.: Automatic document classification. J. Assoc. Comput. Mach. 10, 151–162 (1963)
Boyd-Graber, J.L., Hu, Y., Mimno, D.M.: Applications of topic models. Found. Trends Inf. Retrieval 20(20), 1–154 (2017)
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296 (2009)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Doogan, C., Buntine, W.: Topic model or topic twaddle? Re-evaluating semantic interpretability measures. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Human Language Technologies, pp. 3824–3848 (2021)
Hoyle, A., Goel, P., Hian-Cheong, A., Peskov, D., Boyd-Graber, J., Resnik, P.: Is automated topic model evaluation broken? The incoherence of coherence. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Iker, H.P.: An historical note on the use of word-frequency contiguities in content analysis. Comput. Humanit. 13(2), 93–98 (1974)
Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Behav. Sci. 10(2), 173–183 (1965)
Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research. Today’s Speech 20(2), 25–31 (1972)
Jelodar, H., et al.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2018)
Klein, R.H., Iker, H.P.: The lack of differentiation between male and female in Schreber’s autobiography. J. Abnorm. Psychol. 83(3), 234–239 (1974)
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 EMNLP Conference, pp. 262–272. ACL (2011)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. ACL (2010)
Paur, J.: Boeing’s 787 is as innovative inside and outside. Wired. Conde Nast, 24 December 2009
Peladeau, N., Davoodi, E.: Comparison of latent Dirichlet modeling and factor analysis for topic extraction: a lesson of history. In: 51st Hawaii International Conference on System Sciences (HICSS), pp. 615–623. IEEE (2018)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. pp. 399–408 (2015)
Sainte-Marie, P., Robillard, P., Bratley, P.: An application of principal components analysis to the works of Molière. Comput. Humanit. 7(3), 131–137 (1973)
Sowa, C.A., Sowa, J.F.: Thought clusters in early Greek oral poetry. Comput. Humanit. 8(3), 131–146 (1972)
Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models. In: Proceeding of the 5th International Conference on Learning Representations (2017)
Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Peladeau, N. (2022). Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization. In: Rosso, P., Basile, V., MartÃnez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-08473-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)