Loading [a11y]/accessibility-menu.js
Topic Modeling of Short Texts: A Pseudo-Document View With Word Embedding Enhancement | IEEE Journals & Magazine | IEEE Xplore

Topic Modeling of Short Texts: A Pseudo-Document View With Word Embedding Enhancement


Abstract:

Recent years have witnessed the unprecedented growth of online social media, resulting in short texts being the prevalent format of information on the Internet. Given the...Show More

Abstract:

Recent years have witnessed the unprecedented growth of online social media, resulting in short texts being the prevalent format of information on the Internet. Given the sparsity of data, however, short-text topic modeling remains a critical yet much-watched challenge in both academia and industry. Research has been devoted to building different types of probabilistic topic models for short texts, among which self-aggregation methods emerged recently to provide informative cross-text word co-occurrences. However, models along this line are still in their infancy and typically yield overfit results and exhibit high computational costs. In this paper, we propose a novel model called Pseudo-document-based Topic Model (PTM), which introduces the concept of pseudo-document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo-documents rather than short texts, PTM yields excellent performance in accuracy and efficiency. A word embedding-enhanced PTM (WE-PTM) is also proposed to leverage pre-trained word embeddings, which is essential to further alleviating data sparsity. Extensive experiments with self-aggregation or word embedding-based baselines on four real-world datasets including two online media short texts, demonstrate the high-quality topics learned by our models. Robustness to limited training samples and the explainable semantics of topics are also investigated.
Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 35, Issue: 1, 01 January 2023)
Page(s): 972 - 985
Date of Publication: 14 April 2021

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.