Skip to main content

Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks

  • Conference paper
  • First Online:
Data Mining (AusDM 2022)

Abstract

The most popular topic modelling algorithm, Latent Dirichlet Allocation, produces a simple set of topics. However, topics naturally exist in a hierarchy with larger, more general super-topics and smaller, more specific sub-topics. We develop a novel topic modelling algorithm, Community Topic, that mines communities from word co-occurrence networks to produce topics. The fractal structure of networks provides a natural topic hierarchy where sub-topics can be found by iteratively mining the sub-graph formed by a single topic. Similarly, super-topics can by found by mining the network of topic hyper-nodes. We compare the topic hierarchies discovered by Community Topic to those produced by two probabilistic graphical topic models and find that Community Topic uncovers a topic hierarchy with a more coherent structure and a tighter relationship between parent and child topics. Community Topic is able to find this hierarchy more quickly and allows for on-demand sub- and super-topic discovery, facilitating corpus exploration by researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.kaggle.com/competitions/learn-ai-bbc/data.

  2. 2.

    https://bab2min.github.io/tomotopy/v0.12.2/en/.

  3. 3.

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html.

  4. 4.

    https://huggingface.co/datasets/reuters21578.

  5. 5.

    https://www.kaggle.com/competitions/learn-ai-bbc/data.

  6. 6.

    https://spacy.io/.

  7. 7.

    https://radimrehurek.com/gensim/.

  8. 8.

    https://radimrehurek.com/gensim/models/coherencemodel.html.

References

  1. Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)-Long Papers, pp. 13–22 (2013)

    Google Scholar 

  2. Austin, E., Zaïane, O., Largeron, C.: Community topic: topic model inference by consecutive word community discovery. In: Proceedings of COLING 2022, the 32nd International Conference on Computational Linguistics (2022)

    Google Scholar 

  3. Blei, D., Lafferty, J.: Correlated topic models. In: Advances in Neural Information Processing Systems, vol. 18, p. 147 (2006)

    Google Scholar 

  4. Blei, D., Lafferty, J.: Dynamic topic models. In: Proceeding of the 23rd International Conference on Machine Learning, pp. 113–120 (2006). https://doi.org/10.1145/1143844.1143859

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003). https://doi.org/10.1016/B978-0-12-411519-4.00006-9

  6. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 30, 31–40 (2009)

    Google Scholar 

  7. Burkhardt, S., Kramer, S.: Decoupling sparsity and smoothness in the Dirichlet variational autoencoder topic model. J. Mach. Learn. Res. 20(131), 1–27 (2019)

    MathSciNet  MATH  Google Scholar 

  8. Chen, J., Zaïane, O.R., Goebel, R.: An unsupervised approach to cluster web search results based on word sense communities. In: 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 725–729. IEEE (2008). https://doi.org/10.1109/WIIAT.2008.24

  9. Coscia, M., Giannotti, F., Pedreschi, D.: A classification for community discovery methods in complex networks. Stat. Anal. Data Min. ASA Data Sci. J. 4(5), 512–546 (2011). https://doi.org/10.1002/sam.10133

    Article  MathSciNet  MATH  Google Scholar 

  10. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990). 10.1002/(sici)1097-4571(199009)41:6\(<\)391::aid-asi1\(>\)3.0.co;2-9

    Google Scholar 

  11. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020). https://doi.org/10.1162/tacl_a_00325

    Article  Google Scholar 

  12. Dziri, N., Kamalloo, E., Mathewson, K., Zaïane, O.R.: Augmenting neural response generation with context-aware topical attention. In: Proceedings of the First Workshop on NLP for Conversational AI, pp. 18–31 (2019). https://doi.org/10.18653/v1/W19-4103

  13. Epasto, A., Lattanzi, S., Paes Leme, R.: Ego-splitting framework: from non-overlapping to overlapping clusters. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 145–154 (2017)

    Google Scholar 

  14. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010). https://doi.org/10.1016/j.physrep.2009.11.002

    Article  MathSciNet  Google Scholar 

  15. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016). https://doi.org/10.1016/j.physrep.2016.09.002

    Article  MathSciNet  Google Scholar 

  16. Griffiths, T., Jordan, M., Tenenbaum, J., Blei, D.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems, vol. 16 (2003)

    Google Scholar 

  17. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999). https://doi.org/10.1145/312624.312649

  18. Hoyle, A., Goel, P., Hian-Cheong, A., Peskov, D., Boyd-Graber, J., Resnik, P.: Is automated topic model evaluation broken? The incoherence of coherence. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  19. Kim, J.H., Kim, D., Kim, S., Oh, A.: Modeling topic hierarchies with the recursive Chinese restaurant process. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 783–792 (2012)

    Google Scholar 

  20. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  21. Kingma, D.P., Welling, M., et al.: An introduction to variational autoencoders. Found. Trends Mach. Learn. 12(4), 307–392 (2019). https://doi.org/10.1561/9781680836233

    Article  MATH  Google Scholar 

  22. Krasnashchok, K., Jouili, S.: Improving topic quality by promoting named entities in topic modeling. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 247–253 (2018). https://doi.org/10.18653/v1/P18-2040

  23. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999). https://doi.org/10.1038/44565

    Article  MATH  Google Scholar 

  24. Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 577–584. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1143844.1143917

  25. Mantyla, M.V., Claes, M., Farooq, U.: Measuring LDA topic stability from clusters of replicated runs. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 1–4 (2018). https://doi.org/10.1145/3239235.3267435

  26. Martin, F., Johnson, M.: More efficient topic modelling through a noun only approach. In: Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 111–115 (2015)

    Google Scholar 

  27. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892 (2013). https://doi.org/10.1145/2484028.2484166

  28. Miao, Y., Yu, L., Blunsom, P.: Neural variational inference for text processing. In: International Conference on Machine Learning, pp. 1727–1736. PMLR (2016)

    Google Scholar 

  29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  30. Nalisnick, E., Smyth, P.: Stick-breaking variational autoencoders. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  31. Nan, F., Ding, R., Nallapati, R., Xiang, B.: Topic modeling with Wasserstein autoencoders. arXiv preprint arXiv:1907.12374 (2019). https://doi.org/10.18653/v1/P19-1640

  32. Newman, M., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004). https://doi.org/10.1103/physreve.69.026113

    Article  Google Scholar 

  33. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015). https://doi.org/10.1145/2684822.2685324

  34. Sakr, S., et al.: The future is big graphs: a community view on graph processing systems. Commun. ACM 64(9), 62–71 (2021). https://doi.org/10.1145/3434642

    Article  Google Scholar 

  35. Schofield, A., Mimno, D.: Comparing apples to apple: the effects of stemmers on topic models. Trans. Assoc. Comput. Linguist. 4, 287–300 (2016). https://doi.org/10.1162/tacl_a_00099

    Article  Google Scholar 

  36. Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  37. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315 (2004). https://doi.org/10.1145/1014052.1014087

  38. Traag, V.A., Waltman, L., Van Eck, N.J.: From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9(1), 1–12 (2019). https://doi.org/10.1038/s41598-019-41695-z

    Article  Google Scholar 

  39. Yang, K., Cai, Y., Chen, Z., Leung, H., Lau, R.: Exploring topic discriminating power of words in latent Dirichlet allocation. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2238–2247 (2016)

    Google Scholar 

  40. Zhang, H., Chen, B., Guo, D., Zhou, M.: WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In: 6th International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  41. Zhao, H., Du, L., Buntine, W., Liu, G.: MetaLDA: a topic model that efficiently incorporates meta information. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 635–644 (2017). https://doi.org/10.1109/ICDM.2017.73

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Austin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Austin, E., Trabelsi, A., Largeron, C., Zaïane, O.R. (2022). Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks. In: Park, L.A.F., et al. Data Mining. AusDM 2022. Communications in Computer and Information Science, vol 1741. Springer, Singapore. https://doi.org/10.1007/978-981-19-8746-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8746-5_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8745-8

  • Online ISBN: 978-981-19-8746-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics