Probabilistic topic modeling for short text based on word embedding networks

Pita, Marcelo; Nunes, Matheus; Pappa, Gisele L.

doi:10.1007/s10489-022-03388-5

Probabilistic topic modeling for short text based on word embedding networks

Published: 06 April 2022

Volume 52, pages 17829–17844, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

379 Accesses
1 Altmetric
Explore all metrics

Abstract

Uncovering topics in short texts can be an arduous task. The inadequacy of general-purpose topic models for handling short documents may be explained by the difficulty in dealing with scarce context information. A variety of strategies have been proposed to address this problem, such as using application-specific information, generating larger pseudo-documents, or modeling a single topic per document. This paper introduces a novel strategy to solve this problem named Vec2Graph Topic Model (VGTM). It creates a graph-based representation for the analyzed corpus using word embeddings, named Vec2Graph, and infers topics from overlapping communities patterns on this graph. Vec2Graph leverages the semantics of word embeddings to create a dense similarity graph of words, mitigating the lack of context in short text documents. Experiments evaluating topic coherence quality in four benchmarks and two real-world datasets show that VGTM achieves the best overall results (obtaining statistically better results in 10 out of 18 experiments) in comparison with standard and state-of-the-art short text topic models. We also analyze the relationship between one of our evaluated metrics – NPMI – and structural patterns in the Vec2Graph representation. We found that networks that present a strong community structure tend to present higher NPMI values, suggesting the possibility of direct measurement and potential control of topic coherence through these network features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic Modeling over Short Texts by Incorporating Word Embeddings

Incorporating word embeddings into topic modeling of short text

Article 18 December 2018

Wang Gao, Min Peng, … Gang Tian

Short Text Topic Model with Word Embeddings and Context Information

Availability of Data and Material

Links to the datasets appear as footnotes in the main body of the paper. There are two datasets (Courses and CSM) that can be made available under request, as they may include sensitive information.

Code Availability

The Vec2Graph code is available at https://github.com/marcelopita/vec2graph_paper. The VGTM code is available at https://github.com/marcelopita/vgtm.

Notes

There are other approaches, such as visual qualitative analysis, indirect evaluation (e.g., document classification) and topic diversity [23].
Code available at: https://github.com/marcelopita/vec2graph_paper (2021/01/01).
Dataset available at: https://github.com/marcelopita/datasets/blob/master/sanders.csvhttps://github.com/marcelopita/datasets/blob/ https://github.com/marcelopita/datasets/blob/master/sanders.csvmaster/sanders.csv (2021/01/01)
An interactive version of this graph as available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmlhttps://homepages. https://homepages.dcc.ufmg.br/~marcelo.pita/vec2graph/corpus_graph.htmldcc.ufmg.br/\(\sim \)marcelo.pita/vec2graph/corpus_graph.html (2021/01/01)
Code available at: https://github.com/marcelopita/vgtm (2021/01/01).
The original label in [47] for this type of regularizer is semantic regularizer, but we consider latent space regularizer a more appropriate nomenclature, as used in [48].
It is important to note that the use of NMF as an overlapping community detector is essentially different from the traditional use of NMF as a topic discovery method. In our case, we have a word-word adjacency matrix derived from \(\mathcal {G}_{C}\), while in the traditional case we usually have a document-term TF-IDF matrix.
An interactive version of this graph with topic information is available at: https://homepages.dcc.ufmg.br/~marcelo.pita/vgtm/sanders.html(2021/09/15)
Benchmark datasets available at: https://github.com/marcelopita/datasets/ (2021/01/01)
In this case, a sample of 15 million documents from the WMT11 news corpus, available at http://www.statmt.org/wmt11/training-monolingual.tgz (2021/09/10)
The NPMI score was calculated using the Palmetto tool [57]. Code available at: https://github.com/dice-group/Palmetto(2021/09/10).
Code available at: https://github.com/dice-group/Palmettohttps://github.com/dice-group/Palmetto (2021/09/10).
Code available at: https://github.com/gabrielmip/LDAOpthttps://github.com/gabrielmip/LDAOpt (2021/01/01).
Code available at: https://github.com/xiaohuiyan/BTM (2021/01/01).
Code available at: https://github.com/marcelopita/drex_published (2021/01/01).
GPU-DMM implementation of the STTM tool [13, 61]. Code available at: https://github.com/qiang2100/STTM(2021/01/01).
Code available at: https://github.com/tshi04/SeaNMF (2021/01/01).
Code available at: https://github.com/feliperviegas/cluwords (2021/01/01).
Implemented according to the original paper [10].
Data extracted from a XML file containing 8,102,107 articles and 2,120,659 words.
Data extracted from 17 Brazilian and European Portuguese corpora in a total of 1,395,926,282 words. Available at: http://www.nilc.icmc.usp.br/embeddings (2021/01/01).

References

Boyd-Graber JL, Hu Y, Mimno D et al (2017) Applications of topic models. Now Publishers Incorporated, 11
Rosso P, Errecalde M, Pinto D (2013) Analysis of short texts on the web: introduction to special issue. Lang Resour Eval 47(1):123–126
Article Google Scholar
Zhang H, Zhong G (2016) Improving short text classification by learning vector representations of both words and hidden topics. Knowl-Based Syst 102:76–86
Article Google Scholar
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. JMLR 3:993–1022
MATH Google Scholar
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: SIGIR, ACM, pp 267–273
Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, ACM, pp 1445–1456
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: KDD, ACM, pp 233–242
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: A pseudo-document view. In: KDD, ACM, pp 2105–2114
Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398
Article Google Scholar
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81
Article Google Scholar
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842
Article Google Scholar
Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2020) Short text topic modeling techniques, applications, and performance: A survey. TKDE, pp 1–1
Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. In: ICLR, pp 1–12
Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: EMNLP, pp 1532–1543
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. TACL 3:299–313
Article Google Scholar
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: SIGIR, ACM, pp 165–174
Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: WWW, pp 1105–1114
Viegas F, Canuto S, Gomes C, Luiz W, Rosa T, Ribas S, Rocha L, Gonçalves MA (2019) Cluwords: exploiting semantic word clustering representation for enhanced topic modeling. In: WSDM, pp 753–761
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: KDD Workshops, ACM, pp 80–88
Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed Representations of Words and Phrases and their Compositionality. In: NeurIPS, pp 1–9
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computer Surveys 45(4):43
Article MATH Google Scholar
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8:439–453
Article Google Scholar
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: ICLR, pp 1–12
Zhang H, Chen B, Cong Y, Guo D, Liu H, Zhou M (2020) Deep autoencoding topic model with scalable hybrid bayesian inference. IEEE TPAMI
Zhang H, Chen B, Guo D, Zhou M (2018) WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In: ICLR
Gupta P, Chaudhary Y, Buettner F, Schütze H (2019) Document informed neural autoregressive topic models with distributional prior. In: AAAI, vol 33, pp 6505–6512
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: AAAI, pp 2270–2276
Li X, Li C, Chi J, Ouyang J (2018) Short text topic modeling by exploring original documents. Knowl Inf Syst 56(2):443–462
Article Google Scholar
Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC
Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. In: AACL-IJCNLP, pp 795–804
Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: SIGIR, pp 375–384
Li X, Zhang A, Li C, Guo L, Wang W, Ouyang J (2019) Relational biterm topic model: Short-text topic modeling using word embeddings. The Computer Journal 62(3):359–372
Article Google Scholar
Tuan AP, Bach TX, Nguyen TH, Linh NV, Than K (2020) Bag of biterms modeling for short texts. Knowl. Inf. Syst. 62(10):4055–4090
Article Google Scholar
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR, pp 889–892
Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: PAKDD, Springer, pp 363–374
Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: NAACL, pp 725–734
Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2019) Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 61(2):1123–1145
Article Google Scholar
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Information Processing & Management 56(6):102060
Article Google Scholar
Osman AH, Barukub OM (2020) Graph-based text representation and matching: A review of the state of the art and future challenges. IEEE Access 8:87562–87583
Article Google Scholar
Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: AACL-IJCNLP, pp 1702–1712
Meladianos P, Tixier A, Nikolentzos I, Vazirgiannis M (2017) Real-time keyword extraction from conversations. In: EACL, pp 462–467
Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Information retrieval 15(1):54–92
Article Google Scholar
Rousseau F, Vazirgiannis M (2013) Graph-of-word and tw-idf: new approach to ad hoc ir. In: CIKM, pp 59–68
Malliaros FD, Vazirgiannis M (2017) Graph-based text representations: Boosting text mining, nlp and information retrieval with graphs. In: EMNLP
David E, Jon K (2010) Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, New York, NY, USA
MATH Google Scholar
Skianis K, Rousseau F, Vazirgiannis M (2016) Regularizing text categorization with clusters of words. In: EMNLP, pp 1827–1837
Yang L, Cao X, Jin D, Wang X, Meng D (2014) A unified semi-supervised community detection framework using latent space graph regularization. Transactions on Cybernetics 45(11):2585–2598
Article Google Scholar
Amelio A, Pizzuti C (2014) Overlapping community discovery methods: a survey. In: Social Networks: Analysis and Case Studies. Springer, pp 105–125
Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. DMKD 22(3):493–521
MathSciNet MATH Google Scholar
Zhang Y, Yeung D-Y (2012) Overlapping community detection via bounded nonnegative matrix tri-factorization. In: KDD, ACM, pp 606–614
Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the β-divergence. Neural computation 23(9):2421–2456
Article MathSciNet MATH Google Scholar
Sanders NJ (2011) Sanders-twitter sentiment corpus. Sanders Analytics LLC 242:1–4
Google Scholar
Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, ACM, pp 91–100
Vitale D, Ferragina P, Scaiella U (2012) Classification of short texts by deploying topical annotations. In: ECIR, Springer, pp 376–387
The Writing Center, University of North Carolina at Chapel Hill: Paragraphs (2019)
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: WSDM, ACM, pp 399–408
Doogan C, Buntine W (2021) Topic model or topic twaddle? re-evaluating semantic interpretability measures. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 3824–3848
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. GSCL, pp 31–40
Newman D, Noh Y, Talley E, Karimi S, Baldwin T (2010) Evaluating topic models for digital libraries. In: JCDL, pp 215–224
Qiang J, Li Y, Yuan Y, Liu W, Wu X (2018) STTM: A tool for short text topic modeling. CoRR abs/1808.02215
Minka T (2000) Estimating a dirichlet distribution. Technical report, MIT
Hartmann NS, Fonseca ER, Shulby CD, Treviso MV, Rodrigues JS, Aluísio SM (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: STIL. SBC, pp 122–131
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. TACL 5:135–146
Article Google Scholar
Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2):026126
Article MathSciNet Google Scholar
Newman M (2018) Networks. Oxford university press

Download references

Acknowledgements

The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) and Serviço Federal de Processamento de Dados (SERPRO), for their financial support.

Funding

Gisele L. Pappa was supported by FAPEMIG (grant no. CEX-PPM-00098-17), MPMG (through the project Analytical Capabilities), and CNPq (grant no. 310833/2019-1). Marcelo Pita was supported by SERPRO (through the Graduate Incentive Program). Matheus Nunes was supported by CNPq (studentship grant).

Author information

Authors and Affiliations

Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Marcelo Pita, Matheus Nunes & Gisele L. Pappa
Serviço Federal de Processamento de Dados, Brasília, Brazil
Marcelo Pita

Authors

Marcelo Pita
View author publications
You can also search for this author in PubMed Google Scholar
Matheus Nunes
View author publications
You can also search for this author in PubMed Google Scholar
Gisele L. Pappa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcelo Pita.

Ethics declarations

Conflicts of Interest

None declared.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pita, M., Nunes, M. & Pappa, G.L. Probabilistic topic modeling for short text based on word embedding networks. Appl Intell 52, 17829–17844 (2022). https://doi.org/10.1007/s10489-022-03388-5

Download citation

Accepted: 14 February 2022
Published: 06 April 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10489-022-03388-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic topic modeling for short text based on word embedding networks

Abstract

Access this article

Similar content being viewed by others

Topic Modeling over Short Texts by Incorporating Word Embeddings

Incorporating word embeddings into topic modeling of short text

Short Text Topic Model with Word Embeddings and Context Information

Availability of Data and Material

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Probabilistic topic modeling for short text based on word embedding networks

Abstract

Access this article

Similar content being viewed by others

Topic Modeling over Short Texts by Incorporating Word Embeddings

Incorporating word embeddings into topic modeling of short text

Short Text Topic Model with Word Embeddings and Context Information

Availability of Data and Material

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation