Skip to main content

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

Abstract

This paper proposes reexamining ancestors of modern topic modeling technique that seem to have been forgotten. We present an experiment where results obtained using six contemporary techniques are compared with a factorization technique developed in the early sixties and a contemporary adaptation of it based on non-negative matrix factorization. Results on internal and external coherence as well as topic diversity suggest that extracting topics by applying factorization methods on a word-by-word correlation matrix computed on documents segmented into smaller contextual windows produces topics that are clearly more coherent and show higher diversity than other topic modeling techniques using term-document matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://radimrehurek.com/gensim/.

  2. 2.

    https://pypi.org/project/bitermplus/.

  3. 3.

    https://github.com/MIND-Lab/OCTIS.

  4. 4.

    https://github.com/MilaNLProc/contextualized-topic-models.

  5. 5.

    https://provalisresearch.com/products/content-analysis-software/.

  6. 6.

    The two additional datasets are available from https://provalisresearch.com/tm/datasets.zip.

  7. 7.

    https://github.com/dice-group/Palmetto.

References

  1. Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: IWCS, vol. 13, pp. 13–22 (2013)

    Google Scholar 

  2. AlSumait, L., Barbará, D., Gentle, J., Domeniconi, C.: Topic significance ranking of LDA generative models. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 67–82. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8_22

    Chapter  Google Scholar 

  3. Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. In: IEEE 53rd Annual Symposium on Foundations of Computer Science, pp. 1–10. IEEE (2012)

    Google Scholar 

  4. Bianchi, F., Terragni, S., Hovy, D: Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 759–766. ACL (2021)

    Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Borko, H.: The construction of an empirically based mathematically derived classification system. In: Proceedings of the Spring Joint Computer Conference, vol. 21, pp. 279–289 (1962)

    Google Scholar 

  7. Borko, H., Bernick, M.: Automatic document classification. J. Assoc. Comput. Mach. 10, 151–162 (1963)

    Article  Google Scholar 

  8. Boyd-Graber, J.L., Hu, Y., Mimno, D.M.: Applications of topic models. Found. Trends Inf. Retrieval 20(20), 1–154 (2017)

    Google Scholar 

  9. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296 (2009)

    Google Scholar 

  10. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  11. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)

    Article  Google Scholar 

  12. Doogan, C., Buntine, W.: Topic model or topic twaddle? Re-evaluating semantic interpretability measures. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Human Language Technologies, pp. 3824–3848 (2021)

    Google Scholar 

  13. Hoyle, A., Goel, P., Hian-Cheong, A., Peskov, D., Boyd-Graber, J., Resnik, P.: Is automated topic model evaluation broken? The incoherence of coherence. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  14. Iker, H.P.: An historical note on the use of word-frequency contiguities in content analysis. Comput. Humanit. 13(2), 93–98 (1974)

    Article  Google Scholar 

  15. Iker, H.P., Harway, N.I.: A computer approach towards the analysis of content. Behav. Sci. 10(2), 173–183 (1965)

    Article  Google Scholar 

  16. Jandt, F.E.: Sources for computer utilization in interpersonal communication instruction and research. Today’s Speech 20(2), 25–31 (1972)

    Article  Google Scholar 

  17. Jelodar, H., et al.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2018)

    Article  Google Scholar 

  18. Klein, R.H., Iker, H.P.: The lack of differentiation between male and female in Schreber’s autobiography. J. Abnorm. Psychol. 83(3), 234–239 (1974)

    Article  Google Scholar 

  19. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 EMNLP Conference, pp. 262–272. ACL (2011)

    Google Scholar 

  20. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. ACL (2010)

    Google Scholar 

  21. Paur, J.: Boeing’s 787 is as innovative inside and outside. Wired. Conde Nast, 24 December 2009

    Google Scholar 

  22. Peladeau, N., Davoodi, E.: Comparison of latent Dirichlet modeling and factor analysis for topic extraction: a lesson of history. In: 51st Hawaii International Conference on System Sciences (HICSS), pp. 615–623. IEEE (2018)

    Google Scholar 

  23. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. pp. 399–408 (2015)

    Google Scholar 

  24. Sainte-Marie, P., Robillard, P., Bratley, P.: An application of principal components analysis to the works of Molière. Comput. Humanit. 7(3), 131–137 (1973)

    Article  Google Scholar 

  25. Sowa, C.A., Sowa, J.F.: Thought clusters in early Greek oral poetry. Comput. Humanit. 8(3), 131–146 (1972)

    Article  Google Scholar 

  26. Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models. In: Proceeding of the 5th International Conference on Learning Representations (2017)

    Google Scholar 

  27. Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Normand Peladeau .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Peladeau, N. (2022). Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08473-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08472-0

  • Online ISBN: 978-3-031-08473-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics