Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors
Introduction
Topic modeling along with text clustering are key tasks in text mining [2], which can be unified to benefit mutually from each other [43], [46]. In particular, topic modeling exposes the inherent semantics of a whole corpus of (not necessarily homogeneous) text documents. The uncovered semantics is a representation of the (meaning of) text documents as mixtures of topics, with topics being suitable word rankings. Performing topic modeling on a text corpus, while clustering its documents simultaneously, enables the summarization of clusters through corresponding distributions over the topics treated by the respective documents. Such cluster-specific topic distributions capture the underlying semantics of disjoint groups of homogeneous documents, thus providing a more detailed and coherent understanding of text. Symmetrically, text clustering uncovers patterns of homogeneity in text corpora. However, the semantic coherence of the discovered clusters can be penalized, if homogeneity only involves lexical regularities across text documents. The exploitation of the bag-of-words model for raw text processing likely worsens cluster quality. This is because of the resulting huge text representation (whose dimensionality amounts to vocabulary size) and its sparsity (which is especially challenging in the case of short texts). Clustering a text corpus, while simultaneously modeling the topics of its documents, allows for avoiding the foregoing limitations. Topic modeling provides a concise semantic representation of the text documents within a low-dimensional space of understandable topics. This permits a more effective partitioning of the text corpus into semantically-coherent and intelligible clusters.
Combining text clustering with topic modeling is challenging for the following reasons: foremost, the two tasks have to be suitably interpenetrated, in principle, both should operate in an interdependent manner, with each task acting as an enhancement of the other one; in addition, a synergic interaction between the two tasks should be ideally devised, so as to capture and suitably exploit the syntactic and also semantic relationships between words.
In this article, a new approach to the seamless integration of text clustering with topic modeling is discussed. The proposed approach is grounded in a principled combination of solid foundations from several disciplines. These encompass probabilistic graphical modeling [27], Bayesian statistics [10], [20], [45], generative latent-factor modeling [6], [34], text mining [2] and word vectors [4], [32].
The intuition behind this approach consists in inferring the topics and cluster memberships of text documents from their contents. To this end, DISCOVER (Document topIcS and Clusters from wOrd VEctoRs) is developed, an innovative generative model of topics, text and clusters in document collections. Under DISCOVER, the input text documents are conceived as the observed outcome of an imaginary generative process. The latter is governed by clusters as well as topics, which operate as interacting latent factors. According to the generative semantics of DISCOVER, document clusters are endowed with respective multinomial distributions on the underlying topics. Basically, such topic distributions enforce intra-cluster coherence. A multinomial distribution is placed over clusters to pick document membership. Thus, the generic text document is generated in two steps. At first, the distribution over clusters is sampled to establish its membership, then, the topical distribution of the chosen cluster is repeatedly sampled to word the document content. Overall, there are three appealing features of the generative process modeled by DISCOVER: firstly, each textual document comprises word vectors, rather than discrete text units, this choice permits the syntactic and also semantic regularities across words to be taken suitably into account; secondly, the individual document clusters are explicitly assigned descriptive topic distributions as their respective semantics; thirdly, uncertainty is handled probabilistically according to the consolidated Bayesian treatment [6].
All observations and latent factors in the generative process of DISCOVER are characterized as random variables. In compliance with latent-factor modeling, random variables are distinguished into observed and unobserved. Specifically, the individual documents are characterized by means of observed random variables. The latter take on word vectors (drawn from the representation of topics below), rather than discrete words. The unobserved random variables are employed to characterize the latent factors since these are neither directly observable nor explicitly measurable. In more detail, topics are characterized as multivariate Gaussian distributions on the space of word vectors [4], [32]. Their precision and mean are unobserved random variables, drawn from respective conjugate Gaussian-Wishart priors. Also, the cluster membership of each document is an unobserved random variable, sampled from the multinomial distribution over clusters. All multinomial distributions are drawn from corresponding conjugate Dirichlet priors. Yet, the conditional (in) dependencies among the aforementioned random variables are defined through the elegant formalism of probabilistic graphical modeling. In the context of the generative process of DISCOVER, such conditional (in) dependencies specify the interaction of text clustering with topic modeling, in addition to their influence on document wording.
Under DISCOVER, topic modeling and text clustering are performed simultaneously by Bayesian reasoning. The latter consists in learning the values of the latent random variables of DISCOVER via posterior inference [20], [23], [45] with parameter estimation. More precisely, collapsed Gibbs sampling is used for the a posteriori inference of the assignments of documents to clusters. Parameter estimation is utilized for the distribution over cluster memberships as well as the topic distributions of clusters and documents.
An extensive comparative experimentation over benchmark corpora demonstrates the effectiveness of DISCOVER in clustering texts and coherently uncovering their semantic topics. Notably, the experimentation of the topic modeling performance is articulated into a quantitative and qualitative assessment. The quantitative assessment accounts for intrinsic and extrinsic criteria corresponding to, respectively, the semantic coherence of the inferred topics and the classification effectiveness enabled by such topics. The qualitative assessment is a case study which elucidates the output of DISCOVER on real-world document corpora. It looks into the results from one of the chosen benchmark collections. Our experimentation also investigates the time efficiency and scalability of DISCOVER with the size of both the underlying text corpus and the word vectors.
The originality of DISCOVER lies in the Bayesian probabilistic formalization of an unprecedented and effective interplay between topic modeling in the space of word vectors and a particular instance of text clustering, which also involves the summarization/explanation of cluster semantics. The innovative contributions of this article are summarized below:
- •
The synergic pairing of text clustering with topic modeling is explored through an innovative approach.
- •
A Bayesian probabilistic generative model of text corpora, i.e., DISCOVER (Document topIcS and Clusters from wOrd VEctoRs), is developed.
- •
Under DISCOVER, text clustering is integrated with topic modeling as interacting latent factors, which influence content generation.
- •
Word vectors are suitably utilized under DISCOVER, in order to account for the syntactic and also semantic regularities across words.
- •
The inherent semantics of the uncovered clusters is clearly explained through intelligible topic distributions.
- •
The mathematical and algorithmic details of collapsed Gibbs sampling with parameter estimation are derived to perform text clustering jointly with topic modeling.
- •
An empirical comparative evaluation of DISCOVER is conducted over benchmark corpora.
- •
A new class of competitors is specifically introduced to contrast DISCOVER against pipelines of established approaches to text clustering as well as topic modeling.
- •
The results of this approach on real-word document corpora are elucidated through an explicative case study.
The rest of this article is structured as follows: notation and preliminary concepts are introduced in Section 2; the DISCOVER model is developed in Section 3; collapsed Gibbs sampling with parameter estimation is derived in Section 4; the empirical assessment of DISCOVER is presented in Section 5; a review of seminal related works is provided in Section 6; finally, conclusions are drawn in Section 7, where future research is also highlighted.
Section snippets
Preliminaries
Let be a text corpus1 on a vocabulary . is a set comprising V words, i.e., . is a collection of D documents, i.e., . Additionally, any document of actually comprises lexical elements from , i.e., . For the purpose of capturing the syntactic and also semantic regularities across words in , any document of is
The DISCOVER Model
DISCOVER (Document topIcS and Clusters from wOrd VEctoRs) is a generative latent-factor model of document collections with their respective clusters and topics. Under DISCOVER, any corpus is conceived as the only observed result of a Bayesian probabilistic generative process. In this process, the constituting elements of and are random variables. The random variables in , and are regarded as latent factors. These govern the generation of , although their values are
Posterior inference
DISCOVER integrates text clustering with topic modeling by means of corresponding latent random variables, which interact in the Bayesian probabilistic generation of document corpora. Thus, given a corpus , both tasks are performed under DISCOVER through Bayesian reasoning. The latter involves the inference of a posterior distribution , with which to trace back to the latent random variables in , and . This amounts implicitly to carrying out text clustering
Empirical evaluation
A thorough experimentation of our approach was carried out on real-world benchmark text corpora. The pursued purposes are manifold, i.e.:
- •
evaluating its effectiveness in clustering corpora of texts and uncovering their topics;
- •
assessing whether the integration of the two tasks is actually more effective than each task in isolation;
- •
assessing whether the integration of the two tasks is actually more effective than suitably pipelining both tasks through a trivial sequential arrangement.
- •
studying its
Related works
Topic models are meant to represent and uncover the themes of a text corpus [7], [41]. The spectrum of topic model applications is very wide, encompassing information retrieval, natural language processing, computer vision, relevance judgments, social media analysis, sentiment analysis as well as geographic topic modeling [9], [22], [19], [15], [26], [31], [11], [17]. There are two broad families of topic models: traditional and enhanced. Traditional topic models, such as [25], [9], [8], [42],
Conclusions
A new model-based approach is presented to combining text clustering with topic modeling. Both were seamlessly and synergically integrated under DISCOVER, a Bayesian probabilistic generative model of text collections. Under DISCOVER, documents comprise word vectors in order for the syntactic and also semantic regularities across words to be captured. Document clusters and their topics are intended as interacting latent factors, which rule content generation. These latent factors are unveiled
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Gianni Costa: Conceptualization, Methodology, Software, Writing - original draft, Visualization, Investigation, Supervision, Software, Validation, Writing - review & editing. Riccardo Ortale: Conceptualization, Methodology, Software, Writing - original draft, Visualization, Investigation, Supervision, Software, Validation, Writing - review & editing.
References (50)
- et al.
A tutorial on bayesian nonparametric models
J. Math. Psychol.
(2012) - et al.
A survey of text clustering algorithms
- M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E.D. Trippe, J.B. Gutierrez, K. Kochut, A brief survey of text mining:...
- et al.
An introduction to mcmc for machine learning
Mach. Learn.
(2003) - et al.
A neural probabilistic language model
J. Mach. Learn. Res.
(2003) Grouping Multidimensional Data, chapter A Survey of Clustering Data Mining Techniques
(2006)Pattern Recognition and Machine Learning
(2006)- et al.
Text Mining: Classification, Clustering, and Applications, chapter Topic Models
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
(2009) - D.M. Blei, J.D. Lafferty, Correlated topic models, in: Proc. of Advances in Neural Information Processing Systems,...
- et al.
Latent dirichlet allocation
J. Mach. Learn. Res.
(2003)
Bayesian Inference in Statistical Analysis
Applications of topic models
Found. Trends Inf. Retrieval
Document clustering using locality preserving indexing
IEEE Trans. Knowl. Data Eng.
Locally consistent concept factorization for document clustering
IEEE Trans. Knowl. Data Eng.
editor. Partitional Clustering Algorithms
Reading tea leaves: how humans interpret topic models
Marrying community discovery and role analysis in social media via topic modeling
Proc. of Pacific-Asia Conference on Knowledge Discovery and Data Mining
Gaussian lda for topic models with word embeddings
Bayesian Data Analysis
The Elements of Statistical Learning
Cited by (13)
A novel topic clustering algorithm based on graph neural network for question topic diversity
2023, Information SciencesHierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters
2022, International Journal of Approximate ReasoningCitation Excerpt :Besides, the devised synergy is required to flexibly capture the underlying relationship between the semantics of clusters and the semantics of the individual texts therein in a principled manner, so that the latter turns out to be realistically consistent with the former, except for some plausible degree of variation. Lately, the interrelation of document clustering and topic modeling was explored in [2–4]. However, as discussed in Section 2.3, such studies introduce demanding interrelations, which do not explicitly focus on the relationship between document-specific and custer-specific semantics.
A Web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph
2022, Information SciencesCitation Excerpt :Currently, it is a popular way to evaluate the functional similarity between Web services by calculating the similarity of service representation vectors. Topic models used to generate service representation vectors mainly include LSA, LDA, BTM, HDP and GSDMM [7]. Aiming to enhance Web service clustering quality, we propose a clustering method based on the improved GSDMM (TE-GSDMM) and service collaboration graph.
Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modeling
2021, Knowledge-Based SystemsCitation Excerpt :Unlike [50–52], DETERMINE formulates explicit interrelationships between topic modeling and document clustering. Topic modeling is synergically married to document clustering under MGCTM [53], DETECTOR [54], EXPLORE [55] and DISCOVER [56]. MGCTM substantially varies from DETERMINE.
Robust subspace clustering based on automatic weighted multiple kernel learning
2021, Information SciencesCitation Excerpt :Subspace clustering methods are used to divide high-dimensional data samples from different subspaces into certain clusters so that similar data samples are located in the same low-dimensional space. Owing to its outstanding performances, subspace clustering methods have been popularized for application in data excavation [43], pattern recognition [6], machine learning [16], signal analysis [6], image processing [17] and system identification [1] to complete practical tasks such as motion segmentation [29], image segmentation [39], face clustering [35], document clustering [33,4], gene sequence analysis [38], image saliency detection [28] and social media analysis [23]. In past years, many studies on subspace clustering have been conducted.