Abstract
Uncovering topics in short texts can be an arduous task. The inadequacy of general-purpose topic models for handling short documents may be explained by the difficulty in dealing with scarce context information. A variety of strategies have been proposed to address this problem, such as using application-specific information, generating larger pseudo-documents, or modeling a single topic per document. This paper introduces a novel strategy to solve this problem named Vec2Graph Topic Model (VGTM). It creates a graph-based representation for the analyzed corpus using word embeddings, named Vec2Graph, and infers topics from overlapping communities patterns on this graph. Vec2Graph leverages the semantics of word embeddings to create a dense similarity graph of words, mitigating the lack of context in short text documents. Experiments evaluating topic coherence quality in four benchmarks and two real-world datasets show that VGTM achieves the best overall results (obtaining statistically better results in 10 out of 18 experiments) in comparison with standard and state-of-the-art short text topic models. We also analyze the relationship between one of our evaluated metrics – NPMI – and structural patterns in the Vec2Graph representation. We found that networks that present a strong community structure tend to present higher NPMI values, suggesting the possibility of direct measurement and potential control of topic coherence through these network features.
Similar content being viewed by others
Availability of Data and Material
Links to the datasets appear as footnotes in the main body of the paper. There are two datasets (Courses and CSM) that can be made available under request, as they may include sensitive information.
Code Availability
The Vec2Graph code is available at https://github.com/marcelopita/vec2graph_paper. The VGTM code is available at https://github.com/marcelopita/vgtm.
Notes
There are other approaches, such as visual qualitative analysis, indirect evaluation (e.g., document classification) and topic diversity [23].
Code available at: https://github.com/marcelopita/vec2graph_paper (2021/01/01).
Dataset available at: https://github.com/marcelopita/datasets/blob/master/sanders.csvhttps://github.com/marcelopita/datasets/blob/ https://github.com/marcelopita/datasets/blob/master/sanders.csvmaster/sanders.csv (2021/01/01)
An interactive version of this graph as available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmlhttps://homepages. https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmldcc.ufmg.br/\(\sim \)marcelo.pita/vec2graph/corpus_graph.html (2021/01/01)
Code available at: https://github.com/marcelopita/vgtm (2021/01/01).
It is important to note that the use of NMF as an overlapping community detector is essentially different from the traditional use of NMF as a topic discovery method. In our case, we have a word-word adjacency matrix derived from \(\mathcal {G}_{C}\), while in the traditional case we usually have a document-term TF-IDF matrix.
An interactive version of this graph with topic information is available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vgtm/sanders.html(2021/09/15)
Benchmark datasets available at: https://github.com/marcelopita/datasets/ (2021/01/01)
In this case, a sample of 15 million documents from the WMT11 news corpus, available at http://www.statmt.org/wmt11/training-monolingual.tgz (2021/09/10)
The NPMI score was calculated using the Palmetto tool [57]. Code available at: https://github.com/dice-group/Palmetto(2021/09/10).
Code available at: https://github.com/dice-group/Palmettohttps://github.com/dice-group/Palmetto (2021/09/10).
Code available at: https://github.com/gabrielmip/LDAOpthttps://github.com/gabrielmip/LDAOpt (2021/01/01).
Code available at: https://github.com/xiaohuiyan/BTM (2021/01/01).
Code available at: https://github.com/marcelopita/drex_published (2021/01/01).
GPU-DMM implementation of the STTM tool [13, 61]. Code available at: https://github.com/qiang2100/STTM(2021/01/01).
Code available at: https://github.com/tshi04/SeaNMF (2021/01/01).
Code available at: https://github.com/feliperviegas/cluwords (2021/01/01).
Implemented according to the original paper [10].
Data extracted from a XML file containing 8,102,107 articles and 2,120,659 words.
Data extracted from 17 Brazilian and European Portuguese corpora in a total of 1,395,926,282 words. Available at: http://www.nilc.icmc.usp.br/embeddings (2021/01/01).
References
Boyd-Graber JL, Hu Y, Mimno D et al (2017) Applications of topic models. Now Publishers Incorporated, 11
Rosso P, Errecalde M, Pinto D (2013) Analysis of short texts on the web: introduction to special issue. Lang Resour Eval 47(1):123–126
Zhang H, Zhong G (2016) Improving short text classification by learning vector representations of both words and hidden topics. Knowl-Based Syst 102:76–86
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. JMLR 3:993–1022
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: SIGIR, ACM, pp 267–273
Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, ACM, pp 1445–1456
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: KDD, ACM, pp 233–242
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: A pseudo-document view. In: KDD, ACM, pp 2105–2114
Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842
Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2020) Short text topic modeling techniques, applications, and performance: A survey. TKDE, pp 1–1
Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. In: ICLR, pp 1–12
Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: EMNLP, pp 1532–1543
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. TACL 3:299–313
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: SIGIR, ACM, pp 165–174
Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: WWW, pp 1105–1114
Viegas F, Canuto S, Gomes C, Luiz W, Rosa T, Ribas S, Rocha L, Gonçalves MA (2019) Cluwords: exploiting semantic word clustering representation for enhanced topic modeling. In: WSDM, pp 753–761
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: KDD Workshops, ACM, pp 80–88
Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed Representations of Words and Phrases and their Compositionality. In: NeurIPS, pp 1–9
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computer Surveys 45(4):43
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8:439–453
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: ICLR, pp 1–12
Zhang H, Chen B, Cong Y, Guo D, Liu H, Zhou M (2020) Deep autoencoding topic model with scalable hybrid bayesian inference. IEEE TPAMI
Zhang H, Chen B, Guo D, Zhou M (2018) WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In: ICLR
Gupta P, Chaudhary Y, Buettner F, Schütze H (2019) Document informed neural autoregressive topic models with distributional prior. In: AAAI, vol 33, pp 6505–6512
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: AAAI, pp 2270–2276
Li X, Li C, Chi J, Ouyang J (2018) Short text topic modeling by exploring original documents. Knowl Inf Syst 56(2):443–462
Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC
Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. In: AACL-IJCNLP, pp 795–804
Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: SIGIR, pp 375–384
Li X, Zhang A, Li C, Guo L, Wang W, Ouyang J (2019) Relational biterm topic model: Short-text topic modeling using word embeddings. The Computer Journal 62(3):359–372
Tuan AP, Bach TX, Nguyen TH, Linh NV, Than K (2020) Bag of biterms modeling for short texts. Knowl. Inf. Syst. 62(10):4055–4090
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR, pp 889–892
Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: PAKDD, Springer, pp 363–374
Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: NAACL, pp 725–734
Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2019) Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 61(2):1123–1145
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Information Processing & Management 56(6):102060
Osman AH, Barukub OM (2020) Graph-based text representation and matching: A review of the state of the art and future challenges. IEEE Access 8:87562–87583
Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: AACL-IJCNLP, pp 1702–1712
Meladianos P, Tixier A, Nikolentzos I, Vazirgiannis M (2017) Real-time keyword extraction from conversations. In: EACL, pp 462–467
Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Information retrieval 15(1):54–92
Rousseau F, Vazirgiannis M (2013) Graph-of-word and tw-idf: new approach to ad hoc ir. In: CIKM, pp 59–68
Malliaros FD, Vazirgiannis M (2017) Graph-based text representations: Boosting text mining, nlp and information retrieval with graphs. In: EMNLP
David E, Jon K (2010) Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, New York, NY, USA
Skianis K, Rousseau F, Vazirgiannis M (2016) Regularizing text categorization with clusters of words. In: EMNLP, pp 1827–1837
Yang L, Cao X, Jin D, Wang X, Meng D (2014) A unified semi-supervised community detection framework using latent space graph regularization. Transactions on Cybernetics 45(11):2585–2598
Amelio A, Pizzuti C (2014) Overlapping community discovery methods: a survey. In: Social Networks: Analysis and Case Studies. Springer, pp 105–125
Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. DMKD 22(3):493–521
Zhang Y, Yeung D-Y (2012) Overlapping community detection via bounded nonnegative matrix tri-factorization. In: KDD, ACM, pp 606–614
Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the β-divergence. Neural computation 23(9):2421–2456
Sanders NJ (2011) Sanders-twitter sentiment corpus. Sanders Analytics LLC 242:1–4
Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, ACM, pp 91–100
Vitale D, Ferragina P, Scaiella U (2012) Classification of short texts by deploying topical annotations. In: ECIR, Springer, pp 376–387
The Writing Center, University of North Carolina at Chapel Hill: Paragraphs (2019)
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: WSDM, ACM, pp 399–408
Doogan C, Buntine W (2021) Topic model or topic twaddle? re-evaluating semantic interpretability measures. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 3824–3848
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. GSCL, pp 31–40
Newman D, Noh Y, Talley E, Karimi S, Baldwin T (2010) Evaluating topic models for digital libraries. In: JCDL, pp 215–224
Qiang J, Li Y, Yuan Y, Liu W, Wu X (2018) STTM: A tool for short text topic modeling. CoRR abs/1808.02215
Minka T (2000) Estimating a dirichlet distribution. Technical report, MIT
Hartmann NS, Fonseca ER, Shulby CD, Treviso MV, Rodrigues JS, Aluísio SM (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: STIL. SBC, pp 122–131
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. TACL 5:135–146
Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2):026126
Newman M (2018) Networks. Oxford university press
Acknowledgements
The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) and Serviço Federal de Processamento de Dados (SERPRO), for their financial support.
Funding
Gisele L. Pappa was supported by FAPEMIG (grant no. CEX-PPM-00098-17), MPMG (through the project Analytical Capabilities), and CNPq (grant no. 310833/2019-1). Marcelo Pita was supported by SERPRO (through the Graduate Incentive Program). Matheus Nunes was supported by CNPq (studentship grant).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
None declared.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pita, M., Nunes, M. & Pappa, G.L. Probabilistic topic modeling for short text based on word embedding networks. Appl Intell 52, 17829–17844 (2022). https://doi.org/10.1007/s10489-022-03388-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03388-5