Skip to main content
Log in

Probabilistic topic modeling for short text based on word embedding networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Uncovering topics in short texts can be an arduous task. The inadequacy of general-purpose topic models for handling short documents may be explained by the difficulty in dealing with scarce context information. A variety of strategies have been proposed to address this problem, such as using application-specific information, generating larger pseudo-documents, or modeling a single topic per document. This paper introduces a novel strategy to solve this problem named Vec2Graph Topic Model (VGTM). It creates a graph-based representation for the analyzed corpus using word embeddings, named Vec2Graph, and infers topics from overlapping communities patterns on this graph. Vec2Graph leverages the semantics of word embeddings to create a dense similarity graph of words, mitigating the lack of context in short text documents. Experiments evaluating topic coherence quality in four benchmarks and two real-world datasets show that VGTM achieves the best overall results (obtaining statistically better results in 10 out of 18 experiments) in comparison with standard and state-of-the-art short text topic models. We also analyze the relationship between one of our evaluated metrics – NPMI – and structural patterns in the Vec2Graph representation. We found that networks that present a strong community structure tend to present higher NPMI values, suggesting the possibility of direct measurement and potential control of topic coherence through these network features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of Data and Material

Links to the datasets appear as footnotes in the main body of the paper. There are two datasets (Courses and CSM) that can be made available under request, as they may include sensitive information.

Code Availability

The Vec2Graph code is available at https://github.com/marcelopita/vec2graph_paper. The VGTM code is available at https://github.com/marcelopita/vgtm.

Notes

  1. There are other approaches, such as visual qualitative analysis, indirect evaluation (e.g., document classification) and topic diversity [23].

  2. Code available at: https://github.com/marcelopita/vec2graph_paper (2021/01/01).

  3. Dataset available at: https://github.com/marcelopita/datasets/blob/master/sanders.csvhttps://github.com/marcelopita/datasets/blob/ https://github.com/marcelopita/datasets/blob/master/sanders.csvmaster/sanders.csv (2021/01/01)

  4. An interactive version of this graph as available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmlhttps://homepages. https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmldcc.ufmg.br/\(\sim \)marcelo.pita/vec2graph/corpus_graph.html (2021/01/01)

  5. Code available at: https://github.com/marcelopita/vgtm (2021/01/01).

  6. The original label in [47] for this type of regularizer is semantic regularizer, but we consider latent space regularizer a more appropriate nomenclature, as used in [48].

  7. It is important to note that the use of NMF as an overlapping community detector is essentially different from the traditional use of NMF as a topic discovery method. In our case, we have a word-word adjacency matrix derived from \(\mathcal {G}_{C}\), while in the traditional case we usually have a document-term TF-IDF matrix.

  8. An interactive version of this graph with topic information is available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vgtm/sanders.html(2021/09/15)

  9. Benchmark datasets available at: https://github.com/marcelopita/datasets/ (2021/01/01)

  10. In this case, a sample of 15 million documents from the WMT11 news corpus, available at http://www.statmt.org/wmt11/training-monolingual.tgz (2021/09/10)

  11. The NPMI score was calculated using the Palmetto tool [57]. Code available at: https://github.com/dice-group/Palmetto(2021/09/10).

  12. Code available at: https://github.com/dice-group/Palmettohttps://github.com/dice-group/Palmetto (2021/09/10).

  13. Code available at: https://github.com/gabrielmip/LDAOpthttps://github.com/gabrielmip/LDAOpt (2021/01/01).

  14. Code available at: https://github.com/xiaohuiyan/BTM (2021/01/01).

  15. Code available at: https://github.com/marcelopita/drex_published (2021/01/01).

  16. GPU-DMM implementation of the STTM tool [13, 61]. Code available at: https://github.com/qiang2100/STTM(2021/01/01).

  17. Code available at: https://github.com/tshi04/SeaNMF (2021/01/01).

  18. Code available at: https://github.com/feliperviegas/cluwords (2021/01/01).

  19. Implemented according to the original paper [10].

  20. Data extracted from a XML file containing 8,102,107 articles and 2,120,659 words.

  21. Data extracted from 17 Brazilian and European Portuguese corpora in a total of 1,395,926,282 words. Available at: http://www.nilc.icmc.usp.br/embeddings (2021/01/01).

References

  1. Boyd-Graber JL, Hu Y, Mimno D et al (2017) Applications of topic models. Now Publishers Incorporated, 11

  2. Rosso P, Errecalde M, Pinto D (2013) Analysis of short texts on the web: introduction to special issue. Lang Resour Eval 47(1):123–126

    Article  Google Scholar 

  3. Zhang H, Zhong G (2016) Improving short text classification by learning vector representations of both words and hidden topics. Knowl-Based Syst 102:76–86

    Article  Google Scholar 

  4. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. JMLR 3:993–1022

    MATH  Google Scholar 

  5. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: SIGIR, ACM, pp 267–273

  6. Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198

  7. Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, ACM, pp 1445–1456

  8. Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: KDD, ACM, pp 233–242

  9. Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: A pseudo-document view. In: KDD, ACM, pp 2105–2114

  10. Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398

    Article  Google Scholar 

  11. Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81

    Article  Google Scholar 

  12. Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842

    Article  Google Scholar 

  13. Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2020) Short text topic modeling techniques, applications, and performance: A survey. TKDE, pp 1–1

  14. Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. In: ICLR, pp 1–12

  15. Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: EMNLP, pp 1532–1543

  16. Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. TACL 3:299–313

    Article  Google Scholar 

  17. Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: SIGIR, ACM, pp 165–174

  18. Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: WWW, pp 1105–1114

  19. Viegas F, Canuto S, Gomes C, Luiz W, Rosa T, Ribas S, Rocha L, Gonçalves MA (2019) Cluwords: exploiting semantic word clustering representation for enhanced topic modeling. In: WSDM, pp 753–761

  20. Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: KDD Workshops, ACM, pp 80–88

  21. Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed Representations of Words and Phrases and their Compositionality. In: NeurIPS, pp 1–9

  22. Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computer Surveys 45(4):43

    Article  MATH  Google Scholar 

  23. Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8:439–453

    Article  Google Scholar 

  24. Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: ICLR, pp 1–12

  25. Zhang H, Chen B, Cong Y, Guo D, Liu H, Zhou M (2020) Deep autoencoding topic model with scalable hybrid bayesian inference. IEEE TPAMI

  26. Zhang H, Chen B, Guo D, Zhou M (2018) WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In: ICLR

  27. Gupta P, Chaudhary Y, Buettner F, Schütze H (2019) Document informed neural autoregressive topic models with distributional prior. In: AAAI, vol 33, pp 6505–6512

  28. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: AAAI, pp 2270–2276

  29. Li X, Li C, Chi J, Ouyang J (2018) Short text topic modeling by exploring original documents. Knowl Inf Syst 56(2):443–462

    Article  Google Scholar 

  30. Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC

  31. Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. In: AACL-IJCNLP, pp 795–804

  32. Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: SIGIR, pp 375–384

  33. Li X, Zhang A, Li C, Guo L, Wang W, Ouyang J (2019) Relational biterm topic model: Short-text topic modeling using word embeddings. The Computer Journal 62(3):359–372

    Article  Google Scholar 

  34. Tuan AP, Bach TX, Nguyen TH, Linh NV, Than K (2020) Bag of biterms modeling for short texts. Knowl. Inf. Syst. 62(10):4055–4090

    Article  Google Scholar 

  35. Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR, pp 889–892

  36. Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: PAKDD, Springer, pp 363–374

  37. Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: NAACL, pp 725–734

  38. Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2019) Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 61(2):1123–1145

    Article  Google Scholar 

  39. Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Information Processing & Management 56(6):102060

    Article  Google Scholar 

  40. Osman AH, Barukub OM (2020) Graph-based text representation and matching: A review of the state of the art and future challenges. IEEE Access 8:87562–87583

    Article  Google Scholar 

  41. Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: AACL-IJCNLP, pp 1702–1712

  42. Meladianos P, Tixier A, Nikolentzos I, Vazirgiannis M (2017) Real-time keyword extraction from conversations. In: EACL, pp 462–467

  43. Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Information retrieval 15(1):54–92

    Article  Google Scholar 

  44. Rousseau F, Vazirgiannis M (2013) Graph-of-word and tw-idf: new approach to ad hoc ir. In: CIKM, pp 59–68

  45. Malliaros FD, Vazirgiannis M (2017) Graph-based text representations: Boosting text mining, nlp and information retrieval with graphs. In: EMNLP

  46. David E, Jon K (2010) Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, New York, NY, USA

    MATH  Google Scholar 

  47. Skianis K, Rousseau F, Vazirgiannis M (2016) Regularizing text categorization with clusters of words. In: EMNLP, pp 1827–1837

  48. Yang L, Cao X, Jin D, Wang X, Meng D (2014) A unified semi-supervised community detection framework using latent space graph regularization. Transactions on Cybernetics 45(11):2585–2598

    Article  Google Scholar 

  49. Amelio A, Pizzuti C (2014) Overlapping community discovery methods: a survey. In: Social Networks: Analysis and Case Studies. Springer, pp 105–125

  50. Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. DMKD 22(3):493–521

    MathSciNet  MATH  Google Scholar 

  51. Zhang Y, Yeung D-Y (2012) Overlapping community detection via bounded nonnegative matrix tri-factorization. In: KDD, ACM, pp 606–614

  52. Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the β-divergence. Neural computation 23(9):2421–2456

    Article  MathSciNet  MATH  Google Scholar 

  53. Sanders NJ (2011) Sanders-twitter sentiment corpus. Sanders Analytics LLC 242:1–4

    Google Scholar 

  54. Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, ACM, pp 91–100

  55. Vitale D, Ferragina P, Scaiella U (2012) Classification of short texts by deploying topical annotations. In: ECIR, Springer, pp 376–387

  56. The Writing Center, University of North Carolina at Chapel Hill: Paragraphs (2019)

  57. Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: WSDM, ACM, pp 399–408

  58. Doogan C, Buntine W (2021) Topic model or topic twaddle? re-evaluating semantic interpretability measures. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 3824–3848

  59. Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. GSCL, pp 31–40

  60. Newman D, Noh Y, Talley E, Karimi S, Baldwin T (2010) Evaluating topic models for digital libraries. In: JCDL, pp 215–224

  61. Qiang J, Li Y, Yuan Y, Liu W, Wu X (2018) STTM: A tool for short text topic modeling. CoRR abs/1808.02215

  62. Minka T (2000) Estimating a dirichlet distribution. Technical report, MIT

  63. Hartmann NS, Fonseca ER, Shulby CD, Treviso MV, Rodrigues JS, Aluísio SM (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: STIL. SBC, pp 122–131

  64. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. TACL 5:135–146

    Article  Google Scholar 

  65. Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2):026126

    Article  MathSciNet  Google Scholar 

  66. Newman M (2018) Networks. Oxford university press

Download references

Acknowledgements

The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) and Serviço Federal de Processamento de Dados (SERPRO), for their financial support.

Funding

Gisele L. Pappa was supported by FAPEMIG (grant no. CEX-PPM-00098-17), MPMG (through the project Analytical Capabilities), and CNPq (grant no. 310833/2019-1). Marcelo Pita was supported by SERPRO (through the Graduate Incentive Program). Matheus Nunes was supported by CNPq (studentship grant).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcelo Pita.

Ethics declarations

Conflicts of Interest

None declared.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pita, M., Nunes, M. & Pappa, G.L. Probabilistic topic modeling for short text based on word embedding networks. Appl Intell 52, 17829–17844 (2022). https://doi.org/10.1007/s10489-022-03388-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03388-5

Keywords

Navigation