1 Introduction

Large text document collections have recently become readily available online. Systematic analyses of these collections are significantly meaningful to various domains. Consider, for example, scientific article archives. (1) We want to organize the articles by subject, and help users explore the archives. This is commonly a substantial multi-label classification task. (2) We want to analyze the article browsing histories of researchers, and build a recommendation system that can list all the interesting and relevant articles. (3) In terms of the submitted journal/conference manuscripts, we want to design a system that can recommend the most professional reviewers.

Although these problems have been well studied, they are still significantly challenging problems in machine learning research. This is because: (1) the text document collections commonly have a high dimensionality, and (2) it is difficult to learn a document’s semantics and any correlations between documents. Statistical modeling research has addressed these challenges and developed various approaches for analyzing text documents (Koller and Friedman 2009). In particular, topic modeling approaches (Blei 2012) have provided a realizable avenue for expressing the latent semantics and hidden structures of documents. As a result, these approaches have been widely used in different application domains.

Latent Dirichlet allocation (LDA) (Blei et al. 2003) is acknowledged as one of the most successful topic modeling approaches. In LDA, each document is represented by a distribution over latent topics, and each topic is described by a distribution over words. It defines a Dirichlet prior beyond all the document-topic distributions, so it does not suffer from the same parameter explosion and over-fitting problems as the probabilistic latent semantic index (PLSI) approach (Hofmann 1999). Recently, researchers have proposed many extensions to LDA in terms of various considerations, i.e., relaxing the LDA assumptions (Blei et al. 2010; Blei and Lafferty 2006, 2007; Doyle and Elkan 2009; Wallach 2006; Wang et al. 2009), incorporating meta data (Blei and McAuliffe 2007; Chang and Blei 2010), and applying it to other kinds of data (Li and Perona 2005; Sivic et al. 2008).

In this paper, we focused on a topic modeling approach and investigated relaxing the assumptions of LDA. Intuitively, we know that larger document collections may contain more latent topics. To capture the latent semantics, all documents in LDA are represented by the same K topics. This leads to a “forced topic” problem. For example, consider a large academic paper archive that covers many latent topics such as “artificial intelligence”, “data mining”, “network”, “inorganic chemistry”, “organic chemistry”, and “high-polymer chemistry”. Computer science articles may only contain the three computer-related topics, whereas chemistry articles prefer to cover the three chemistry-related topics. But in LDA, all articles must cover all six topics. That is, chemistry articles do not involve the “network” topic, but they are forced to cover it in LDA. We can organize documents into different groups (i.e., computer science and chemistry) and then assign related topics to the groups (e.g., “network” to computer science and “organic chemistry” to chemistry), as a reasonable method to tackle the “forced topic” problem.

By considering the discussions above, we developed an extension of the LDA model, namely group latent Dirichlet allocation (GLDA). In GLDA, there are two kinds of topics: local topics and global topics. A local topic corresponds to a topic that only occurs in the subset of the corpus, whereas the global topic corresponds one that is ubiquitous to the whole corpus. Closely related local topics are clustered together as latent groups. Each document first selects a group, and then generates the topic distribution over both the local topics with respect to the selected group and the global topics. Finally, it samples words from the corresponding topic-word distributions. Based on the latent groups, GLDA can model the documents using the most related topics, rather than constraining each document to all the topics. In this paper, we used the variational inference algorithm and parameter estimation for GLDA. Additionally, we developed an online inference algorithm to model large-scale data. We conducted a number of experiments on topic modeling and document clustering to evaluate the proposed model. Our experimental results demonstrate that GLDA can achieve a competitive performance when compared with state-of-the-art approaches.

The rest of this paper is organized as follows. In Sect. 2, we review topic modeling approaches. In Sect. 3, we describe the proposed GLDA model. Our evaluation results are presented in Sect. 4, and our conclusions and some potential future work are discussed in Sect. 5.

2 Topic modeling approach

In this section, we review the history of topic modeling approaches. Table 1 summarizes several important notations used in this paper.

Table 1 Notation description

To the best of our knowledge, PLSI (Hofmann 1999) was the first well known topic model for latent semantic analysis (Deerwester et al. 1990). However, it suffers from two intractable problems: parameter explosion and over-fitting. Blei et al. (2003) proposed LDA to tackle these two problems by introducing a Dirichlet prior to the latent topics. They also developed an effective variational inference algorithm to infer the model. As a result, LDA is in widespread use. As shown in Fig. 1a, the generative process of LDA is summarized as follows:

  1. 1.

    For each topic \(k\)

    1. (a)

      Sample a distribution over words: \({\phi _k} \sim Dirichlet\,\left( \beta \right) \)

  2. 2.

    For each document d in the corpus W

    1. (a)

      Sample a distribution over topics: \({\theta _d} \sim Dirichlet\,\left( \alpha \right) \)

    2. (b)

      For each of the \({N_d}\) words \({w_{d,n}}\)

      1. i.

        Sample a topic \({z_{d,n}} \sim Multinomial\,\left( {{\theta _d}} \right) \)

      2. ii.

        Sample a word \({w_{d,n}} \sim Multinomial\,\left( {{\phi _{{z_{d,n}}}}} \right) \)

In topic modeling research, one active direction is to relax the LDA assumptions to further uncover more sophisticated structures in the documents. Traditionally, the extensions of LDA focus on four fundamental assumptions (Blei 2012): “bag of words”, “bag of documents”, “fixed topics”, and “independent topics”.

  1. 1.

    “Bag of words” is an exchangeable assumption that the orders of words in documents do not matter. Although this assumption is reasonable for uncovering a coarse semantic structure and has benefits for computation, it is unrealistic in the sense of human cognition. Wallach proposed the bigram topic model (Wallach 2006), in which the word generation is associated with both its topic and context, i.e., the previous word. Wang et al. (2007) developed the topic N-gram model, which discovers phrases using the word orders and adjacent topics. Boyd-Graber and Blei (2008) considered the syntactic structure and proposed the syntactic topic model. These approaches model words non-exchangeably and improve the language modeling performance.

  2. 2.

    “Bag of documents” is also an exchangeable assumption that the orders of documents in collections do not matter. This assumption is unreasonable for collections that span years. Blei and Lafferty (2006) proposed the dynamic topic model, where the topics change over time. In this method, each topic is a sequence of distributions over words, so it can capture the dynamic latent semantics.

  3. 3.

    The “fixed topics” assumption signifies that the number of topics in LDA is fixed and known. Typically, we must determine the number of topics experimentally. To address this problem, Blei et al. (2010) developed Bayesian nonparametric topic models using the Dirichlet process (Teh et al. 2006). In such topic models, the number of topics is determined by the data itself. Furthermore, they can explore hierarchies of topics such as the tree of topics.

  4. 4.

    “Independent topics” is the assumption that, in LDA, topics are independent from each other. The pachinko allocation model (Li and McCallum 2006) used a topic directed acyclic graph to describe the correlation among topics. With the same goal, the correlated topic model (Blei and Lafferty 2007) used the logistic normal prior for per-document topic proportion, instead of the Dirichlet prior in LDA.

There are other modified topic models that relax various assumptions with respect to LDA, for example, as the spherical topic model (Reisinger et al. 2010) and sparse topic model (sparseTM) (Reisinger et al. 2009). Recently, Wallach (2008) argued that it was unrealistic to force each document to associate with the same K topics. Considering this analysis, they proposed a cluster based topic model (CTM), which organizes topics into different groups and individualizes each group using the group-specific Dirichlet prior of the document-topic distribution. For each document, CTM first generates a group indicator and then samples the local topic distribution from the Dirichlet prior specific to the selected group. Based on CTM, Xie and Xing (2013) further introduced global topics to capture the global semantics, and proposed the multi-grain cluster topic model (MGCTM). As shown in Fig. 1b, for each word, MGCTM must select between local and global topics, and then generate local or global topics using its choice. In our work, we investigated how to relax the assumptions of LDA, and propose the GLDA model. The proposed GLDA model defines local topics specific to a group as a solution to the “forced topic” problem, and defines global topics to capture the background semantics. It samples the document-topic distributions from a combination of the Dirichlet prior with the selected group’s local prior and the global prior. The GLDA representation is less ambiguous than MGCTM. More importantly, GLDA further considers the relationships between local topics and global topics in terms of different groups. There are more detailed discussions in Sect. 3.5.

Fig. 1
figure 1

The three topic models: a LDA, b MGCTM, and c GLDA

3 Proposed approach

In this section, we first introduce the GLDA model, and then propose the procedures for inference, parameter estimation and online learning. Finally, we compare MGCTM and GLDA in detail.

3.1 GLDA

In LDA, all documents are represented by the same K topics. This results in the “forced topic” problem, which has two aspects: (1) in practice, the documents belonging to different groups might only involve some topics, but they are forced to cover all topics (an example is shown in Sect. 1); and (2) LDA has no mechanism to cover the background semantics in a corpus, so the semantics must fill in each specific topic. For instance, the words that cover the background semantics are commonly ubiquitous and frequently occur in the corpus. As a result, these meaningless words might dominate topics, e.g., “introduction” to “network” and “organic chemistry”. This behavior reduces the expressiveness of the topics.

To address the “forced topic” problem mentioned above, we extended LDA to the GLDA model. In GLDA, we assume that (1) there is a corpus-level multinomial distribution \(\pi \), which can be used to generate the group indicator for the documents; (2) each group c corresponds to Kl local topics behind the Dirichlet prior \(\alpha _c^{(l)}\); and (3) to capture the background semantics, all documents share Kg global topics behind the Dirichlet prior \(\alpha ^{(g)}\). To formalize the generation process for document d, we first choose a group indicator \(\eta _d\) from the distribution \(\pi \). We combine the local Dirichlet prior of group \(\eta _d\) with the global Dirichlet prior, to obtain a merged Kd-dimension Dirichlet prior, that is \({\alpha _d} = \left[ {\alpha _{{\eta _d}}^{(l)},{\alpha ^{(g)}}} \right] \). Then, we sample the document-topic distribution \(\theta _d\) over the local topics with respect to the selected group \(\eta _d\), and the global topics (i.e., “selected topics”) from the Dirichlet prior \(\alpha _d\). The words are then generated as in LDA.

As shown in Fig. 1c, the generative process of GLDA is as follows:

  1. 1.

    For each topic \(k\)

    1. (a)

      Sample a distribution over words: \({\phi _k} \sim Dirichlet\,\left( \beta \right) \)

  2. 2.

    For each document d in the corpus W

    1. (a)

      Sample a group: \({\eta _d} \sim Multinomial\,(\pi )\)

    2. (b)

      Sample a distribution over the “selected topics”: \({\theta _d} \sim Dirichlet\,({\alpha _d})\)

    3. (c)

      For each of the \({N_d}\) words \({w_{d,n}}\)

      1. i.

        Sample a topic \({z_{d,n}} \sim Multinomial\,\left( {{\theta _d}} \right) \)

      2. ii.

        Sample a word \({w_{d,n}} \sim Multinomial\,\left( {{\phi _{{z_{d,n}}}}} \right) \)

We can summarize model parameters as \(U = \left\{ {\pi ,\;\left\{ {\alpha _c^{(l)}} \right\} _{c = 1}^C,{\alpha ^{(g)}},\beta } \right\} \) and the latent variables as \(H = \left\{ {\left\{ {{z_{d,n}}} \right\} _{d = 1,n = 1}^{d = D,n = {N_d}},\;\left\{ {{\eta _d}} \right\} _{d = 1}^D,\;\left\{ {{\theta _d}} \right\} _{d = 1}^D,\;\left\{ {{\phi _k}} \right\} _{k = 1}^{KK}} \right\} \).

Reviewing this generation process, we argue that GLDA solves the “forced topics” problem to some extent. On one hand, in contrast to LDA, GLDA gives a two-stage procedure to generate topics. For each document, if a group is chosen then only the topics in this group can be used to describe the document. On other hand, GLDA introduces the concept of global topics to gather the words that describe the background semantics. This helps to purify the specific topics.

3.2 Inference

Given a corpus W, the key inference problem with respect to GLDA is to compute the posterior distribution of the latent variables \(p\left( {H|W,U} \right) \). Because this posterior distribution is intractable to estimate, we use the variational inference (Blei et al. 2003) algorithm for approximate estimation.

The basic idea behind variational inference is to use Jensen’s inequality to approach the tightest lower bound on the log likelihood. To achieve this, we introduced the variational distribution \(q\left( {H|\varOmega } \right) \) (see Fig. 2) in terms of the free variational parameters \(\varOmega = \left\{ {\left\{ {{{\tilde{\pi }}_d}} \right\} _{d = 1}^D,\;\left\{ {{{\tilde{\alpha }}_d}} \right\} _{d = 1}^D,\;\left\{ {{{\tilde{\beta }}_k}} \right\} _{k = 1}^{KK},\;\left\{ {{{\tilde{\theta }}_{d,n}}} \right\} _{d = 1,n = 1}^{d = D,n = {N_d}}} \right\} \), by removing the coupling edges and nodes in GLDA. That is,

$$\begin{aligned} q(H|\varOmega ) = \prod \limits _{k = 1}^{KK} {q({\phi _k}|{{\tilde{\beta }}_k})} \prod \limits _{d = 1}^D {\left( {q({\eta _d}|{{\tilde{\pi }}_d})q({\theta _d}|{{\tilde{\alpha }}_d})\prod \limits _{n = 1}^{{N_d}} {q(z|{{\tilde{\theta }}_{d,n}})} } \right) } \end{aligned}$$
(1)

where \(\left\{ {{{\tilde{\alpha }}_d}} \right\} _{d = 1}^D\) and \(\left\{ {{{\tilde{\beta }}_k}} \right\} _{k = 1}^{KK}\) are Dirichlet parameters; and \(\left\{ {{{\tilde{\pi }}_d}} \right\} _{d = 1}^D\) and \(\left\{ {{{\tilde{\theta }}_{d,n}}} \right\} _{d = 1,n = 1}^{d = D,n = {N_d}}\) are multinomial distribution parameters.

Fig. 2
figure 2

The graphical model representation of the variational distribution

We transformed the task of finding the tightest lower bound on the log likelihood into the problem of maximizing the following lower bound:

$$\begin{aligned} \fancyscript{L}\left( {\varOmega |U} \right) = {E_q}\left[ {\log p\left( {H,W|U} \right) } \right] - {E_q}\left[ {\log q\left( {H|\varOmega } \right) } \right] \end{aligned}$$
(2)

which is described in the Appendix.

We use the fixed point method to maximize this lower bound with respect to the free variational parameters \(\varOmega \). The derivation of this process is also shown in the Appendix. The updating rules are:

$$\begin{aligned}&{{\tilde{\pi }}_{d,c}} \propto {\pi _c}\nonumber \\&\quad \times \exp \left( \begin{array}{l} \log \varGamma \left( {\sum \limits _{k = 1}^{Kd} {\alpha _k^{(c)}} } \right) - \sum \limits _{k = 1}^{Kd} {\log \varGamma \left( {\alpha _k^{(c)}} \right) } \\ + \sum \limits _{k = 1}^{Kd} {\left( {\alpha _k^{(c)} - 1} \right) \left( {\varPsi \left( {{{\tilde{\alpha }}_{d,k}}} \right) - \varPsi \left( {\sum \limits _{j = 1}^{Kd} {{{\tilde{\alpha }}_{d,j}}} } \right) } \right) } \\ + \sum \limits _{n = 1}^{{N_d}} {\sum \limits _{k = 1}^{Kl} {{{\tilde{\theta }}_{d,n,k}}\left( {\varPsi \left( {{{\tilde{\beta }}_{c \cdot k,{w_{dn}}}}} \right) - \varPsi \left( {\sum \limits _{j = 1}^V {{{\tilde{\beta }}_{c \cdot k,j}}} } \right) } \right) } } \\ \end{array} \right) \end{aligned}$$
(3)
$$\begin{aligned} {\tilde{\theta }_{d,n,k}} \propto \exp \left( \begin{array}{l} \left( {\varPsi \left( {{{\tilde{\alpha }}_{d,k}}} \right) - \varPsi \left( {\sum \limits _{j = 1}^{Kd} {{{\tilde{\alpha }}_{d,j}}} } \right) } \right) \\ + \sum \limits _{c = 1}^C {{{\tilde{\pi }}_{d,c}}\left( {\varPsi \left( {{{\tilde{\beta }}_{c \cdot k,{w_{dn}}}}} \right) - \varPsi \left( {\sum \limits _{j = 1}^V {{{\tilde{\beta }}_{k,j}}} } \right) } \right) }\\ \end{array} \right) \end{aligned}$$
(4)
$$\begin{aligned} {\tilde{\alpha }_{d,k}} = \sum \limits _{c = 1}^C {{{\tilde{\pi }}_{d,c}}\alpha _k^{(c)}} + \sum \limits _{n = 1}^{{N_d}} {{{\tilde{\theta }}_{d,n,k}}} \end{aligned}$$
(5)

where \(\alpha ^{(c)}\) is the Dirichlet prior that combines the local topics specific to group c with the global topics (i.e., \({\alpha ^{(c)}} = \left[ {\alpha _c^{(l)},{\alpha ^{(g)}}} \right] \)); \({\tilde{\beta }_{c \cdot k}}\) corresponds to the kth \(\tilde{\beta }\) of group c; \(\varGamma \left( \cdot \right) \) is the gamma function; and \(\varPsi \left( \cdot \right) \) is the digamma function Then,

$$\left\{ \begin{array}{ll} {{\tilde{\beta }}_{k,v}} = {\beta _v} + \sum \limits _{d = 1}^D {\sum \limits _{n = 1}^{{N_d}} {{{\tilde{\theta }}_{d,n,k}}w_{d,n}^v} }&{}\quad if\;k\;is\;global \\ {{\tilde{\beta }}_{c \cdot k,v}} = {\beta _v} + \sum \limits _{d = 1}^D {\sum \limits _{n = 1}^{{N_d}} {{{\tilde{\pi }}_{d,c}}{{\tilde{\theta }}_{d,n,k}}w_{d,n}^v} }&{} otherwise \\ \end{array} \right.$$
(6)

where

$$\begin{aligned} w_{d,n}^v = \left\{ \begin{array}{ll} 1&{} if\;{w_{d,n}} = v \\ 0&{} otherwise \\ \end{array} \right. \end{aligned}$$

The full variational inference procedure is summarized in Algorithm 1.

figure a

3.3 Parameter estimation

In this section, we consider the parameter estimation for GLDA. Given a corpus, we wish to optimize the model parameters (U) using a maximum likelihood estimation. Again, the likelihood function \(p\left( {W|U} \right) \) is intractable to compute. So, we use the variational expectation maximization (variational EM) algorithm, which alternatively updates the free variational parameters (\(\varOmega \)) and model parameters (U).

Similar to Algorithm 1, the variational EM algorithm is summarized in Algorithm 2. In the E-step, we infer \({\tilde{\pi }}_d,{\tilde{\alpha }}_d,{\tilde{\theta }}_d\) using Eqs. (3), (4) and (5) from Sect. 3.2. In the M-step, we estimate \({\tilde{\beta }}_k\) and U. \({\tilde{\beta }}_k\) is also updated using Eq. (6). The Dirichlet parameters (\(\alpha ^{(l)},{\alpha ^{(g)}},\beta \)) are optimized by the Newton–Raphson method described in (Blei et al. 2003), and the multinomial parameter \(\pi \) is updated using:

$$\begin{aligned} {\pi _c} = \frac{{\sum \nolimits _{d = 1}^D {{{\tilde{\pi }}_{d,c}}} }}{D} \end{aligned}$$
(7)
figure b

Comparison with asymmetric LDA GLDA organizes topics into groups. To specialize different groups, we apply asymmetric Dirichlet priors for local topics, so GLDA is in default an asymmetric model. It seems similar with the best version of asymmetric LDA, i.e., AS form (asymmetric topic Dirichlet prior and symmetric word Dirichlet prior), suggested in (Wallach et al. 2009a), so we have stated the relationships between the two models. When inferring a document d, GLDA might equal to AS form LDA with a certain value of \(\tilde{\pi }_d\). However, this is infrequent in practice; and more importantly, the values of \(\pi _d\) are totally different for different documents. In other words, we believe that GLDA and asymmetric LDA are two disparate models.

3.4 Online learning

In this section, we extend Algorithm 2 to an online inference algorithm (Online GLDA) for modeling large-scale data. This work is based on the spirit of stochastic variational inference (SVI) (Hoffman and Wang 2013; Hoffman and Blei 2010), where each iteration uses only a mini-batch of the documents to generate a stochastic gradient, and a stochastic optimization algorithm is used to learn the global parameters of interest.

In the GLDA context, the local variational parameters are \({\tilde{\pi }}_d,{\tilde{\alpha }}_d,{\tilde{\theta }}_d\) and the global parameters are \({\tilde{\beta }}_k, \alpha ^{(l)},{\alpha ^{(g)}},\beta , \pi \). At each iteration t, we first randomly sample M documents and compute their optimal local variational parameters using Eqs. (3), (4) and (5). We then update the global parameters given a learning rate \(\rho _t\) as follows:

In terms of \(\tilde{\beta }\), we can compute the natural gradient \({\nabla _{{\tilde{\beta }}}}\fancyscript{L}\left( {\varOmega |U} \right) \), and then give the updating rule:

$$\left\{ \begin{array}{ll} {{\tilde{\beta }}_{k,v}} \leftarrow {{\tilde{\beta }}_{k,v}} + {\rho _t}\left( { - {{\tilde{\beta }}_{k,v}} + {\beta _v} + \frac{D}{M}\sum \limits _{d = 1}^M {\sum \limits _{n = 1}^{{N_d}} {{{\tilde{\theta }}_{d,n,k}}w_{d,n}^v} } } \right) &{}\quad \;if\;k\;is\;global \\ {{\tilde{\beta }}_{c \bullet k,v}} \leftarrow {{\tilde{\beta }}_{c \bullet k,v}} + {\rho _t}\left( { - {{\tilde{\beta }}_{c \bullet k,v}} + {\beta _v} + \frac{D}{M}\sum \limits _{d = 1}^M {\sum \limits _{n = 1}^{{N_d}} {{{\tilde{\pi }}_{d,c}}{{\tilde{\theta }}_{d,n,k}}w_{d,n}^v} } } \right) &{}\quad \;\;otherwise \\ \end{array} \right.$$
(8)

In terms of \(\alpha ^{(l)},{\alpha ^{(g)}},\beta \), we extend the Newton–Raphson algorithm to the online case as in (Hoffman and Blei 2010):

$$\left\{ \begin{array}{l} \alpha _{c,k}^{(l)} \leftarrow \alpha _{c,k}^{(l)} - {\rho _t}{{\hat{\alpha }}^{(l)}} \\ \alpha _k^{(g)} \leftarrow \alpha _k^{(g)} - {\rho _t}{{\hat{\alpha }}^{(g)}} \\ {\beta _v} \leftarrow {\beta _v} - {\rho _t}\widehat{\beta } \\ \end{array} \right.$$
(9)

where \(\hat{\alpha }^{(l)}\) and \(\hat{\alpha }^{(g)}\) are the inverse of the Hessian times the gradient \({\nabla _{{\alpha ^{(l)}}}}\fancyscript{L}\left( {\varOmega |U} \right) \) and \({\nabla _{{\alpha ^{(g)}}}}\fancyscript{L}\left( {\varOmega |U} \right) \); and \(\hat{\beta }\) is the inverse of the Hessian times the gradient \({\nabla _{{\widetilde{\beta }}}}\fancyscript{L}\left( {\varOmega |U} \right) \).

This is different to the global variational parameters above; it is a constrained maximization of \(\pi \) because \(\sum \nolimits _{c = 1}^C {{\pi _c}} = 1\). So we must subtract a form of projection \(Z\) (Zinkevich 2003), when updating \(\pi \):

$$\begin{aligned} {\pi _c} \leftarrow {\pi _c} + {\rho _t}{\pi _c}\left( {D\sum \limits _{d = 1}^M {{{\widetilde{\pi }}_{d,c}}} } {{{D\sum \limits _{d = 1}^M {{{\widetilde{\pi }}_{d,c}}} }\bigg / {M{\pi _c}}}} - Z \right) , \quad \quad where\;Z = \frac{\left\langle \pi ,{{D\sum \nolimits _{d = 1}^M {{{\widetilde{\pi }}_d}} }\bigg / { {M\pi }}} \right\rangle }{C} \end{aligned}$$
(10)

and \(\left\langle {\;,\;} \right\rangle \) is the inner product.

The Online GLDA is summarized in Algorithm 3.

figure c

3.5 Comparison with MGCTM

MGCTM and GLDA have some similarities, so we have investigated the relationships between the two topic models. In fact, both represent documents using the local topics of the groups and global topics. However, the relationships between the two kind topics are different in MGCTM and GLDA.

A graphical model representation is shown in Fig. 1b. We reviewed the generative process of MGCTM (Xie and Xing 2013) as follows: For each document d, first select a group \(\eta _d\) from the distribution \(\pi \). Then sample a local topic distribution \(\theta _{{\eta _d}}^l\) from the Dirichlet prior \(\alpha _{{\eta _d}}^{(l)}\) with respect to the selected group, and sample a global topic distribution \(\theta ^g\) from the Dirichlet prior \(\alpha ^{(g)}\). The Beta prior \(\gamma \) samples a Bernoulli distribution \({\omega _d}\), which is used to make choice between local and global topics. To generate a word \({w_{d,n}}\), we first choose a topic indicator \({\delta _{d,n}}\) from the distribution \({\omega _d}\). If \({\delta _{d,n}} = 1\), the word \({w_{d,n}}\) will be assigned a local topic with respect to the group \(\eta _d\). If \({\delta _{d,n}} = 0\), the word \({w_{d,n}}\) will be assigned a global topic. Finally a word is generated as in LDA.

Let \(p\left( {{t^g} = k} \right) \) be the probability of generating the kth global topic. In MGCTM \(p\left( {{t^g} = k} \right) = p\left( {{\delta _{d,n}} = 0|{\omega _d}} \right) \cdot p\left( {{t^g} = k|{\theta ^g}} \right) \) and its expectation is:

$$\begin{aligned} {E_p}\left[ k \right] = \frac{{{\gamma _1}}}{{{\gamma _1} + {\gamma _2}}} \times \frac{{\alpha _k^{(g)}}}{{\sum \nolimits _{i = 1}^{Kg} {\alpha _i^{(g)}} }} \end{aligned}$$
(11)

In contrast, GLDA samples the document-topic distribution from the combination Dirichlet prior \({\alpha _d} = \left[ {\alpha _{{\eta _d}}^{(l)},{\alpha ^{(g)}}} \right] \). Therefore, in GLDA equals:

$$\begin{aligned} {E_p}\left[ k \right] = \frac{{\alpha _k^{(g)}}}{{\sum \nolimits _{i = 1}^{Kd} {{\alpha _{d,i}}} }} \end{aligned}$$
(12)

We can transform Eq. (12) into:

$$\begin{aligned} {E_p}\left[ k \right] = \frac{{\sum \nolimits _{i = 1}^{Kg} {\alpha _i^{(g)}} }}{{\sum \nolimits _{i = 1}^{Kg} {\alpha _i^{(g)} + \sum \nolimits _{i = 1}^{Kl} {\alpha _{\eta _d,i}^{(l)}} } }} \times \frac{{\alpha _k^{(g)}}}{{\sum \nolimits _{i = 1}^{Kg} {\alpha _i^{(g)}} }} \end{aligned}$$
(13)

Comparing this with Eqs. (11) and (13), we found that the second terms are the same. The first terms are the probabilities of choosing global topics. Because each group c has its specific local topic prior \(\alpha _c^{(l)}\), the first term of Eq. (13) should be different for different groups. That is to say, in GLDA the relationships between local and global topics are dependent on the different groups, which is not the case in MGCTM. We argue that the assumption in GLDA is reasonable. For example, computer science articles may be naturally more willing to cover common knowledge topics (i.e., global topics) than chemistry articles. In particular, we argue that this consideration is more significant when modeling collections that contain many latent groups.

4 Experiment

In this section, we present our results when evaluating GLDA on two problem domains, i.e., topic modeling and document clustering.

4.1 Dataset

We considered two widely used offline datasets:Footnote 1 20-NewsGroups (20-NG) and WebKB. 20-NG is a balanced dataset. It contains 18,821 documents, which are equally divided into 20 related categories. We used 11,293 documents as the training data, and the remaining 7,528 documents as the testing data. WebKB contains 4,199 documents, which consists of four categories. In contrast to 20-NG, it is an unbalanced dataset, where the largest category contains 1,641 documents and the smallest category only contains 504 documents. We selected 2,803 documents for training, and used the remaining 1,396 documents for testing.

We also chose an online collection. We randomly downloaded 3M documents from Wikipedia (Wiki) using the implementationFootnote 2 in (Hoffman and Blei 2010). We then processed these documents using a standard vocabulary of 7,700 words. We used 2,000 randomly selected documents from the collection for testing.

4.2 Topic modeling

We evaluated the topic modeling performance of GLDA across the three selected corpora. In terms of the offline datasets, we used three state-of-the-art topic models (LDA Blei et al. 2003, CTM Wallach 2008, and MGCTM Xie and Xing 2013) as performance baselines. We downloaded the public version of LDAFootnote 3 and implemented in-house codes for CTM and MGCTM. For fair comparisons, we estimated all of the hyper-parameters of these approaches using the variational EM method, and estimated the GLDA using Algorithm 2. In terms of the online collection (Wiki), we used Online \(\hbox {LDA}^2\) (Hoffman and Blei 2010) as the baseline and GLDA was estimated using Algorithm 3. All these models used the AS (Wallach et al. 2009a) form (asymmetric topic Dirichlet prior and symmetric word Dirichlet prior). The asymmetric topic Dirichlet priors, including local topic Dirichlet priors and global topic Dirichlet priors, were all initializedFootnote 4 as \(50/K\) and estimated using the Newton–Raphson algorithm (Blei et al. 2003). The symmetric word Dirichlet prior \(\beta \) was fixed at 0.01.

Naturally, we can consider the topic model as a special probability density function for generating a corpus. So the topic modeling performance can be evaluated by the likelihood on the held-out test data (Wallach et al. 2009b). In our experiments, we trained all the baseline topic models and the GLDA using the training data, and then compared the perplexity scores of the held-out test data. The perplexity, used by convention in language modeling, is equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity represents a higher performance. Given corpora \(W\) and \(W_{test}\), the perplexity is defined as:

$$\begin{aligned} perplexity({W_{test}}) = \exp \left\{ { - \frac{{\log p\left( {{W_{test}}|W} \right) }}{{\sum \nolimits _{d = 1}^D {{N_d}}}}} \right\} \end{aligned}$$
(14)

4.2.1 Qualitative evaluation

We fit GLDA to two versions of the 20-NG datasets. One is the original 20-NG with stop words, and the other is a filtered 20-NG that has removed stop wordsFootnote 5 (378 in total). For both versions, we set \(C=20, Kl=5\) and \(Kg=20\).

Table 2 illustrates the 10 most popular words for three global topics learnt by GLDA. The global topics learnt from the original 20-NG are almost filled by stop words. Because the stop words are ubiquitous to all documents, they can be explained as background semantics. In other words, GLDA successfully captured the common semantics. The results are clearer for the filtered 20-NG. We observed that Global Topic 1 is about article writing; Global Topic 2 is about time; Global Topic 3 is about both writing and time. These global topics obviously show the common semantics, and can be generated in all documents.

Table 3 shows the local topics from two estimated groups learnt by GLDA. Obviously, local topics cover the local semantics of each group. In Group 1, the local topics are clearly associated with computers. where Topics 1, 2 and 3 correspond to hardware, operating system and network, respectively. In Group 2, the local topics are about sports, including baseball, hockey and game. Although the results for the original 20-NG were affected by stop words (e.g., “can”, “many” and “same”), they effectively captured the local semantics for each group.

Overall, we found that the two-stage generation of GLDA had a positive influence when capturing the semantics. On one hand, the local semantics were first organized at a coarse level (e.g., computer) and then further divided into a fine level (e.g., hardware and network). On the other hand, the common semantics were covered by the global topics. This framework effectively modeled the corpus, even with stop words.

Table 2 The 10 most popular words for several global topics in terms of 20-NG learnt by GLDA
Table 3 The 10 most popular words for several local topics in terms of 20-NG learnt by GLDA

4.2.2 Quantitative evaluation of offline collections

We tested the perplexity scores of the offline collections with different numbers of topics. We used the filtered 20-NG and WebKB datasets. The settings for the 20-NG were as follows: For MGCTM and GLDA, we fixed \(C=20\) and \(Kg=20\), and set \(Kl = 1,2, \ldots ,10\). For the CTM, we set the local topics from 2 to 11, for the same number of total topics. For WebKB, in both MGCTM and GLDA, we fixed \(C=4\) and \(Kg=32\), and set \(Kl = 8,12, \ldots ,32\). For CTM, we set \(Kl = 16,20, \ldots ,40\) for the same number of total topics.

Fig. 3
figure 3

The perplexity performance on 20-NG dataset

The results for 20-NG are shown in Fig. 3. GLDA performed better than LDA and CTM. For LDA, there was a conflict. A larger K (more topics) is required to uncover the complex semantics in large document collections, but many documents naturally only involve some of these topics and the “forced topics” problem is more serious for larger K. Our experimental results confirmed this analysis. The LDA performed better when \(K=40\), and the performance deteriorated for larger K. CTM organizes topics into different groups. Its performance increases with the growth of K. Unfortunately, the CTM lacks the mechanism to distinguish local and global topics. So its peak performance is even worse than LDA for the 20-NG dataset.

Compared with MGCTM, GLDA performed better with respect to the perplexity metric. They both performed worse for \(Kl = 1,2\). Because: (1) a small number of local topics are not enough to adequately cover the local semantics; and (2) relatively few local topics exaggerate the influence of global topics (i.e., one man’s loss is another’s gain). GLDA outperformed MGCTM for \(Kl>2\), e.g., 3,190 in GLDA and 3,620 in MGCTM for \(Kl=6\), and 3,048 in GLDA and 3,323 in MGCTM for \(Kl=8\). We argue that this is because GLDA considers the relationships between local and global topics in terms of the different groups (see the discussions in Sect. 3.5). Our experimental results further validate this point.

Fig. 4
figure 4

The perplexity performance on WebKB dataset

As shown in Fig. 4, GLDA also performed better for WebKB. It performed better than the two simpler models (i.e., LDA and CTM) and slightly outperformed the state-of-the-art MGCTM. For the two simpler topic models, CTM outperformed LDA except when \(K=64\). This is because, for the WebKB dataset, we used a sufficient amount of local topics to capture the local semantics. In particular, we found that the gap between MGCTM and GLDA was smaller than the gap on the 20-NG dataset. This was mainly because GLDA considers the relationships between local and global topics in terms of different groups, in contrast to MGCTM. However, WebKB contains fewer groups (C = 4) than 20-NG (C = 20). So MGCTM approaches GLDA in this case.

Fig. 5
figure 5

The perplexity performance with different Kl and Kg on WebKB dataset

We also investigated how to set the number of local and global topics in GLDA. We used fivefold cross validation, which produced convincing results. Figure 5 shows the averaged perplexity performance for different Kl and Kg using the WebKB dataset. The topic modeling performance was not significantly sensitive to the number of topics, and the variations were not very abrupt. Larger Kl and Kg resulted in a better performance than small values, e.g., the best performance was achieved when Kl = 32 and Kg = 32, and the worst was when Kl = 8, 16 and Kg = 8. More importantly, we found that the performance reduced when \(Kl>Kg\). We argue that this trend is reasonable, because it intuitively requires more global topics to describe the background semantics that are ubiquitous to all the documents.

4.2.3 Quantitative evaluation on Wiki

Comparion with MGCTM Because of the similarities between MGCTM and (non-online) GLDA, we attempted to further compare the two models using a larger collection. To this end, we randomly selected 50,000 documents from the entire 3M Wiki collection (mini-Wiki) for model training, and evaluated MGCTM and GLDA on the test data that contained the 2,000 documents mentioned above.

Because the true number of groups in Wiki is unknown, we tested the perplexity scores using different numbers of groups. For both models, we fixed Kl = 10 and Kg = 20, and set \(C = 2,3, \ldots ,10\). The results are shown in Fig. 6. We can see that GLDA performed better than MGCTM in most cases. When there was a small number of groups (e.g., C = 2, 3, 4), the gap between the two models was relatively small. As C increased, GLDA rapidly diverges from the other model. As discussed in Sect. 3.5, the main difference between the two models is that GLDA considers the relationships between local and global topics in terms of the different groups, but MGCTM does not. In other words, we argue that GLDA is superior to MGCTM with a relatively large value of \(C\). These empirical results support this view, as expected.

Online learning we evaluated the performance of Online GLDA on the entire 3M Wiki collection. We set the mini-size M = 100 and 500. We fixed K = 100 for the Online LDA (Hoffman and Blei 2010), and C = 8, Kl = 10 and Kg = 20 for the online GLDA. The following learning rate is chosen, where the delay \(\tau \) and forgetting rate \(\kappa \) are set as 1,024 and 1, respectively.

$$\begin{aligned} {\rho _t} = {\left( {t + \tau } \right) ^{ - \kappa }} \end{aligned}$$
(15)

The results are shown in Fig. 7. Obviously, Online GLDA outperformed Online LDA. It improved by 150 when M = 100 and approximately 180 when M = 500. This is because a larger dataset must contain more topics. By organizing the topics into groups, we can cluster the relevant topics together. This experimental result shows that GLDA is useful for large-scale data.

Fig. 6
figure 6

The perplexity performance across mini-Wiki

Fig. 7
figure 7

The perplexity performance across Wiki: a/b is the result with M = 100/500

4.3 Document clustering

GLDA assumes that documents belong to groups, so it can be naturally be used for clustering. We evaluated the document clustering performance of the proposed GLDA model using the filtered 20-NG and WebKB datasets. For both datasets, we removed the words that have occurred less than 10 times.

4.3.1 Metric

We evaluated the clustering performance by comparing the obtained cluster indices for documents using the clustering algorithm and the true labels. In our experiments, we used two common metrics (Cai et al. 2011; Zhang et al. 2011): clustering accuracy (AC) and normalized mutual information (NMI). For both metrics, a larger score represents a better performance.

The AC is used to evaluate the final clustering performance. Given a document d, let \(\widetilde{{y_d}}\) and \(y _d\) denote the cluster index and the true label, respectively. Then the AC can be computed by:

$$\begin{aligned} AC = \frac{{\sum \nolimits _{d = 1}^D {\delta \left( {{y_d},map\left( {\widetilde{{y_d}}} \right) } \right) } }}{D} \end{aligned}$$
(16)

where \(\delta \left( {x,y} \right) \) is a delta function that is 1 if \(x=y\) and 0 otherwise; \(map\left( \cdot \right) \) is a function that maps each cluster to a label (as defined in the Kuhn–Munkres algorithm Lovasz and Plummer 1986).

NMI is originally used to measure the statistical information shared between two distributions. Let \(\widetilde{Y}\) be the set of clusters obtained by the clustering algorithm and Y be the true set of labels. Their mutual information is defined as:

$$\begin{aligned} MI\left( {Y,\widetilde{Y}} \right) = \sum \limits _{{y_i} \in Y,\widetilde{{y_j}} \in \widetilde{Y}} {p\left( {{y_i},\widetilde{{y_j}}} \right) \log \left( {\frac{{p\left( {{y_i},\widetilde{{y_j}}} \right) }}{{p\left( {{y_i}} \right) p\left( {\widetilde{{y_j}}} \right) }}} \right) }. \end{aligned}$$

where \(p\left( {{y_i}} \right) \) and \(p\left( {\widetilde{{y_i}}} \right) \) denote the probabilities that a document belongs to the label \(y _i\) and cluster \(\widetilde{{y_j}}\), respectively; \(p\left( {{y_i},\widetilde{{y_j}}} \right) \) is the joint probability that a document belongs to the label \(y_i\) and cluster \(\widetilde{{y_j}}\) at the same time. Here, we normalize \(MI\left( {Y,\widetilde{Y}} \right) \) using:

$$\begin{aligned} NMI\left( {Y,\widetilde{Y}} \right) = \frac{{MI\left( {Y,\widetilde{Y}} \right) }}{{\max \left( {H\left( Y \right) ,H\left( {\widetilde{Y}} \right) } \right) }} \end{aligned}$$
(17)

where \(H\left( Y \right) \) is the entropy of the true label set \(Y\), and \(H\left( {\widetilde{Y}} \right) \) is the entropy of the estimated cluster set \(\widetilde{Y}\).

4.3.2 Performance

We selected several baseline algorithms: non-negative matrix factorization (NMF), entropy weighting K-Means (EWKM) (Jing et al. 2007), LDA (Blei et al. 2003), CTM (Wallach 2008), and MGCTM (Xie and Xing 2013). For the LDA, we followed the experimental studies in (Lu et al. 2011): That is, (1) treat each topic as a cluster (assign a document to cluster x if \(x = \arg {\max _j}{\theta _j}\)); and (2) use symmetric Dirichlet priors \(\alpha \) and \(\beta \), setting \(\alpha =0.1\) and \(\beta =0.01\). For the CTM, we set the number of topics to 120 for the 20-NG dataset and to 40 for the WebKB dataset. In MGCTM and the proposed GLDA model, we used 10 local topics for each group and 20 global topics for the 20-NG dataset, and 32 local topics for each group and 32 global topics for the WebKB dataset. Following the settings in (Xie and Xing 2013), MGCTM initialized the variational document-group distributions with the clustering results of LDA and randomly initialized the other parameters. For GLDA, we initialized the parameters in the same was as MGCTM, and used another version that randomly initialized all the parameters (Ran-GLDA). For all the approaches, we averaged the results over 10 independent runs, and also calculated the pairwise t tests at 5 % significance levels for the GLDA and the baselines.

Table 4 Performance (the average score ± standard deviation) on 20-NG

Table 4 shows the results for the 20-NG dataset. Obviously, the proposed GLDA model achieved the highest scores in both the AC and NMI metrics. GLDA performed much better than the traditional approaches (i.e., NMF and EWKM), and performed competitively when compared to the topic modeling approaches. It outperformed LDA by approximately 5 % in AC and 6 % in NMI, and outperformed CTM by approximately 8 % in AC and 11 % in NMI. Ran-GLDA was approximately 0.5 % better in AC and 0.3 % in NMI than the state-of-the-art MGCTM. More importantly, GLDA outperformed MGCTM by approximately 3 %,in both AC and NMI.

Table 5 Performance (the average score ± standard deviation) on WebKB

Table 5 illustrates the results for the WebKB dataset. As for the 20-NG dataset, GLDA outperformed all the other approaches on the two metrics. For example, GLDA was approximately 4 % better than LDA in AC, and approximately 3.5 % better than CTM in NMI. GLDA outperformed the state-of-the-art MGCTM (approximately 0.3 % better in AC, and 0.6 % better in NMI). Ran-GLDA performed slightly worse than MGCTM (i.e., 0.5 % in AC and NMI), because there are no optimal initial parameters for Ran-GLDA.

Additionally, the p values obtained by the pairwise t tests are reported in Tables 4 and 5. We can clearly see that the proposed GLDA model was statistically superior to the compared algorithms in most cases [i.e., 20-NG (11/12) and WebKB (9/12)]. GLDA was clearly better than NMF, EWKM, LDA and CTM, and was slightly superior to MGCTM. We can also see that the standard deviations of the scores from GLDA were smaller. This further validates the robustness of GLDA.

4.3.3 Study on the number of topics

We investigated the effect of the number of topics on document clustering. Figure 8 illustrates the AC and NMI performance for the WebKB dataset, with different Kl and Kg. We observed that the results were very similar to the results in the topic modeling evaluation. A better performance is achieved when Kl and Kg are both large. In particular, the performance deteriorated when \(Kl>Kg\) (e.g., the worst scores were obtained when Kl = 24, 28, 32 and Kg = 20). This is because, in document clustering, an increase in the number of global topics reduces the discrimination of the local topics for different groups. Therefore, in practice, we suggest the following settings in GLDA: (1) relatively larger Kl and Kg; and (2) \(Kl \le Kg\).

Fig. 8
figure 8

The clustering performance with different Kl and Kg on WebKB dataset, i.e., a AC and b NMI

5 Conclusion

In this paper, we developed GLDA as an extension to the LDA model. The highlight of GLDA is that it organizes topics into groups to capture local semantics, and introduces global topics to cover the background semantics. In contrast to existing techniques, GLDA considers the relationships between local and global topics in terms of the different groups. We developed a variational inference algorithm to model the offline corpora, and further extended an online learning algorithm for GLDA for a large-scale collection and true online data.

We used extensive experiments to evaluate the proposed GLDA model. We compared the topic modeling performance to traditional topic models for both offline and online cases. We also evaluated GLDA for document clustering. Our experimental results demonstrated that GLDA can achieve a state-of-the-art topic modeling performance, and also has a competitive clustering performance when compared with state-of-the-art clustering approaches.

In the future, we hope to develop extensions of GLDA using nonparametric methods, which can adaptively determine the number of groups and topics. It may also be useful to apply GLDA to basic tasks such as classification and sentiment analysis.