Elsevier

Information Sciences

Volume 563, July 2021, Pages 226-240
Information Sciences

Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors

https://doi.org/10.1016/j.ins.2021.01.019Get rights and content

Abstract

An innovative model-based approach to coupling text clustering and topic modeling is introduced, in which the two tasks take advantage of each other. Specifically, the integration is enabled by a new generative model of text corpora. This explains topics, clusters and document content via a Bayesian generative process. In this process, documents include word vectors, to capture the (syntactic and semantic) regularities among words. Topics are multivariate Gaussian distributions on word vectors. Clusters are assigned corresponding topic distributions as their semantics. Content generation is ruled by text clusters and topics, which act as interacting latent factors. Documents are at first placed into respective clusters, then the semantics of these clusters is then repeatedly sampled to draw document topics, which are in turn sampled for word-vector generation.

Under the proposed model, collapsed Gibbs sampling is derived mathematically and implemented algorithmically with parameter estimation for the simultaneous inference of text clusters and topics.

A comparative assessment on real-world benchmark corpora demonstrates the effectiveness of this approach in clustering texts and uncovering their semantics. Intrinsic and extrinsic criteria are adopted to investigate its topic modeling performance, whose results are shown through a case study. Time efficiency and scalability are also studied.

Introduction

Topic modeling along with text clustering are key tasks in text mining [2], which can be unified to benefit mutually from each other [43], [46]. In particular, topic modeling exposes the inherent semantics of a whole corpus of (not necessarily homogeneous) text documents. The uncovered semantics is a representation of the (meaning of) text documents as mixtures of topics, with topics being suitable word rankings. Performing topic modeling on a text corpus, while clustering its documents simultaneously, enables the summarization of clusters through corresponding distributions over the topics treated by the respective documents. Such cluster-specific topic distributions capture the underlying semantics of disjoint groups of homogeneous documents, thus providing a more detailed and coherent understanding of text. Symmetrically, text clustering uncovers patterns of homogeneity in text corpora. However, the semantic coherence of the discovered clusters can be penalized, if homogeneity only involves lexical regularities across text documents. The exploitation of the bag-of-words model for raw text processing likely worsens cluster quality. This is because of the resulting huge text representation (whose dimensionality amounts to vocabulary size) and its sparsity (which is especially challenging in the case of short texts). Clustering a text corpus, while simultaneously modeling the topics of its documents, allows for avoiding the foregoing limitations. Topic modeling provides a concise semantic representation of the text documents within a low-dimensional space of understandable topics. This permits a more effective partitioning of the text corpus into semantically-coherent and intelligible clusters.

Combining text clustering with topic modeling is challenging for the following reasons: foremost, the two tasks have to be suitably interpenetrated, in principle, both should operate in an interdependent manner, with each task acting as an enhancement of the other one; in addition, a synergic interaction between the two tasks should be ideally devised, so as to capture and suitably exploit the syntactic and also semantic relationships between words.

In this article, a new approach to the seamless integration of text clustering with topic modeling is discussed. The proposed approach is grounded in a principled combination of solid foundations from several disciplines. These encompass probabilistic graphical modeling [27], Bayesian statistics [10], [20], [45], generative latent-factor modeling [6], [34], text mining [2] and word vectors [4], [32].

The intuition behind this approach consists in inferring the topics and cluster memberships of text documents from their contents. To this end, DISCOVER (Document topIcS and Clusters from wOrd VEctoRs) is developed, an innovative generative model of topics, text and clusters in document collections. Under DISCOVER, the input text documents are conceived as the observed outcome of an imaginary generative process. The latter is governed by clusters as well as topics, which operate as interacting latent factors. According to the generative semantics of DISCOVER, document clusters are endowed with respective multinomial distributions on the underlying topics. Basically, such topic distributions enforce intra-cluster coherence. A multinomial distribution is placed over clusters to pick document membership. Thus, the generic text document is generated in two steps. At first, the distribution over clusters is sampled to establish its membership, then, the topical distribution of the chosen cluster is repeatedly sampled to word the document content. Overall, there are three appealing features of the generative process modeled by DISCOVER: firstly, each textual document comprises word vectors, rather than discrete text units, this choice permits the syntactic and also semantic regularities across words to be taken suitably into account; secondly, the individual document clusters are explicitly assigned descriptive topic distributions as their respective semantics; thirdly, uncertainty is handled probabilistically according to the consolidated Bayesian treatment [6].

All observations and latent factors in the generative process of DISCOVER are characterized as random variables. In compliance with latent-factor modeling, random variables are distinguished into observed and unobserved. Specifically, the individual documents are characterized by means of observed random variables. The latter take on word vectors (drawn from the representation of topics below), rather than discrete words. The unobserved random variables are employed to characterize the latent factors since these are neither directly observable nor explicitly measurable. In more detail, topics are characterized as multivariate Gaussian distributions on the space of word vectors [4], [32]. Their precision and mean are unobserved random variables, drawn from respective conjugate Gaussian-Wishart priors. Also, the cluster membership of each document is an unobserved random variable, sampled from the multinomial distribution over clusters. All multinomial distributions are drawn from corresponding conjugate Dirichlet priors. Yet, the conditional (in) dependencies among the aforementioned random variables are defined through the elegant formalism of probabilistic graphical modeling. In the context of the generative process of DISCOVER, such conditional (in) dependencies specify the interaction of text clustering with topic modeling, in addition to their influence on document wording.

Under DISCOVER, topic modeling and text clustering are performed simultaneously by Bayesian reasoning. The latter consists in learning the values of the latent random variables of DISCOVER via posterior inference [20], [23], [45] with parameter estimation. More precisely, collapsed Gibbs sampling is used for the a posteriori inference of the assignments of documents to clusters. Parameter estimation is utilized for the distribution over cluster memberships as well as the topic distributions of clusters and documents.

An extensive comparative experimentation over benchmark corpora demonstrates the effectiveness of DISCOVER in clustering texts and coherently uncovering their semantic topics. Notably, the experimentation of the topic modeling performance is articulated into a quantitative and qualitative assessment. The quantitative assessment accounts for intrinsic and extrinsic criteria corresponding to, respectively, the semantic coherence of the inferred topics and the classification effectiveness enabled by such topics. The qualitative assessment is a case study which elucidates the output of DISCOVER on real-world document corpora. It looks into the results from one of the chosen benchmark collections. Our experimentation also investigates the time efficiency and scalability of DISCOVER with the size of both the underlying text corpus and the word vectors.

The originality of DISCOVER lies in the Bayesian probabilistic formalization of an unprecedented and effective interplay between topic modeling in the space of word vectors and a particular instance of text clustering, which also involves the summarization/explanation of cluster semantics. The innovative contributions of this article are summarized below:

  • The synergic pairing of text clustering with topic modeling is explored through an innovative approach.

  • A Bayesian probabilistic generative model of text corpora, i.e., DISCOVER (Document topIcS and Clusters from wOrd VEctoRs), is developed.

  • Under DISCOVER, text clustering is integrated with topic modeling as interacting latent factors, which influence content generation.

  • Word vectors are suitably utilized under DISCOVER, in order to account for the syntactic and also semantic regularities across words.

  • The inherent semantics of the uncovered clusters is clearly explained through intelligible topic distributions.

  • The mathematical and algorithmic details of collapsed Gibbs sampling with parameter estimation are derived to perform text clustering jointly with topic modeling.

  • An empirical comparative evaluation of DISCOVER is conducted over benchmark corpora.

  • A new class of competitors is specifically introduced to contrast DISCOVER against pipelines of established approaches to text clustering as well as topic modeling.

  • The results of this approach on real-word document corpora are elucidated through an explicative case study.

The rest of this article is structured as follows: notation and preliminary concepts are introduced in Section 2; the DISCOVER model is developed in Section 3; collapsed Gibbs sampling with parameter estimation is derived in Section 4; the empirical assessment of DISCOVER is presented in Section 5; a review of seminal related works is provided in Section 6; finally, conclusions are drawn in Section 7, where future research is also highlighted.

Section snippets

Preliminaries

Let D be a text corpus1 on a vocabulary V. V is a set comprising V words, i.e., V{w1,,wV}. D is a collection of D documents, i.e., D={d1,,dD}. Additionally, any document d of D actually comprises nd lexical elements from V, i.e., d{wd,1,,wd,nd|wd,nVwithn=1,,nd}. For the purpose of capturing the syntactic and also semantic regularities across words in D, any document d of D is

The DISCOVER Model

DISCOVER (Document topIcS and Clusters from wOrd VEctoRs) is a generative latent-factor model of document collections with their respective clusters and topics. Under DISCOVER, any corpus D is conceived as the only observed result of a Bayesian probabilistic generative process. In this process, the constituting elements of D,β,Θ,Θ(D),Z,π and C are random variables. The random variables in β,Θ, Z,π and C are regarded as latent factors. These govern the generation of D, although their values are

Posterior inference

DISCOVER integrates text clustering with topic modeling by means of corresponding latent random variables, which interact in the Bayesian probabilistic generation of document corpora. Thus, given a corpus D, both tasks are performed under DISCOVER through Bayesian reasoning. The latter involves the inference of a posterior distribution Pr(β,Θ,Z,π,C|D), with which to trace back to the latent random variables in β,Θ, Z,π and C. This amounts implicitly to carrying out text clustering

Empirical evaluation

A thorough experimentation of our approach was carried out on real-world benchmark text corpora. The pursued purposes are manifold, i.e.:

  • evaluating its effectiveness in clustering corpora of texts and uncovering their topics;

  • assessing whether the integration of the two tasks is actually more effective than each task in isolation;

  • assessing whether the integration of the two tasks is actually more effective than suitably pipelining both tasks through a trivial sequential arrangement.

  • studying its

Related works

Topic models are meant to represent and uncover the themes of a text corpus [7], [41]. The spectrum of topic model applications is very wide, encompassing information retrieval, natural language processing, computer vision, relevance judgments, social media analysis, sentiment analysis as well as geographic topic modeling [9], [22], [19], [15], [26], [31], [11], [17]. There are two broad families of topic models: traditional and enhanced. Traditional topic models, such as [25], [9], [8], [42],

Conclusions

A new model-based approach is presented to combining text clustering with topic modeling. Both were seamlessly and synergically integrated under DISCOVER, a Bayesian probabilistic generative model of text collections. Under DISCOVER, documents comprise word vectors in order for the syntactic and also semantic regularities across words to be captured. Document clusters and their topics are intended as interacting latent factors, which rule content generation. These latent factors are unveiled

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Gianni Costa: Conceptualization, Methodology, Software, Writing - original draft, Visualization, Investigation, Supervision, Software, Validation, Writing - review & editing. Riccardo Ortale: Conceptualization, Methodology, Software, Writing - original draft, Visualization, Investigation, Supervision, Software, Validation, Writing - review & editing.

References (50)

  • S.J. Gershman et al.

    A tutorial on bayesian nonparametric models

    J. Math. Psychol.

    (2012)
  • C. Aggarwal et al.

    A survey of text clustering algorithms

  • M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E.D. Trippe, J.B. Gutierrez, K. Kochut, A brief survey of text mining:...
  • C. Andrieu et al.

    An introduction to mcmc for machine learning

    Mach. Learn.

    (2003)
  • Y. Bengio et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • P. Berkhin

    Grouping Multidimensional Data, chapter A Survey of Clustering Data Mining Techniques

    (2006)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • D. Blei et al.

    Text Mining: Classification, Clustering, and Applications, chapter Topic Models

    Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

    (2009)
  • D.M. Blei, J.D. Lafferty, Correlated topic models, in: Proc. of Advances in Neural Information Processing Systems,...
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • G.E.P. Box et al.

    Bayesian Inference in Statistical Analysis

    (1992)
  • J. Boyd-Graber et al.

    Applications of topic models

    Found. Trends Inf. Retrieval

    (2017)
  • D. Cai et al.

    Document clustering using locality preserving indexing

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • D. Cai et al.

    Locally consistent concept factorization for document clustering

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • M.E. Celebi

    editor. Partitional Clustering Algorithms

    (2015)
  • Y. Cha, J. Cho, Social-network analysis using topic models, in: Proc. of Int. ACM SIGIR Conf. on Research and...
  • J. Chang et al.

    Reading tea leaves: how humans interpret topic models

  • G. Costa et al.

    Marrying community discovery and role analysis in social media via topic modeling

    Proc. of Pacific-Asia Conference on Knowledge Discovery and Data Mining

    (2018)
  • R. Das et al.

    Gaussian lda for topic models with word embeddings

  • L. Dietz, S. Bickel, T. Scheffer, Unsupervised prediction of citation influences, in: Proc. of Int. Conf. on Machine...
  • A. Gelman et al.

    Bayesian Data Analysis

    (2013)
  • T.L. Griffiths, M. Steyvers, Finding scientific topics, in: Proc. of the National Academy of Sciences of the United...
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2009)
  • G. Heinrich. Parameter estimation for text analysis. Technical report, University of Leipzig, 2008. Available at...
  • T. Hofmann, Probabilistic latent semantic indexing, in: Proc. of Int. ACM SIGIR Conf. on Research and Development in...
  • Cited by (13)

    • Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters

      2022, International Journal of Approximate Reasoning
      Citation Excerpt :

      Besides, the devised synergy is required to flexibly capture the underlying relationship between the semantics of clusters and the semantics of the individual texts therein in a principled manner, so that the latter turns out to be realistically consistent with the former, except for some plausible degree of variation. Lately, the interrelation of document clustering and topic modeling was explored in [2–4]. However, as discussed in Section 2.3, such studies introduce demanding interrelations, which do not explicitly focus on the relationship between document-specific and custer-specific semantics.

    • A Web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph

      2022, Information Sciences
      Citation Excerpt :

      Currently, it is a popular way to evaluate the functional similarity between Web services by calculating the similarity of service representation vectors. Topic models used to generate service representation vectors mainly include LSA, LDA, BTM, HDP and GSDMM [7]. Aiming to enhance Web service clustering quality, we propose a clustering method based on the improved GSDMM (TE-GSDMM) and service collaboration graph.

    • Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modeling

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Unlike [50–52], DETERMINE formulates explicit interrelationships between topic modeling and document clustering. Topic modeling is synergically married to document clustering under MGCTM [53], DETECTOR [54], EXPLORE [55] and DISCOVER [56]. MGCTM substantially varies from DETERMINE.

    • Robust subspace clustering based on automatic weighted multiple kernel learning

      2021, Information Sciences
      Citation Excerpt :

      Subspace clustering methods are used to divide high-dimensional data samples from different subspaces into certain clusters so that similar data samples are located in the same low-dimensional space. Owing to its outstanding performances, subspace clustering methods have been popularized for application in data excavation [43], pattern recognition [6], machine learning [16], signal analysis [6], image processing [17] and system identification [1] to complete practical tasks such as motion segmentation [29], image segmentation [39], face clustering [35], document clustering [33,4], gene sequence analysis [38], image saliency detection [28] and social media analysis [23]. In past years, many studies on subspace clustering have been conducted.

    View all citing articles on Scopus
    View full text