Skip to main content
Log in

A topic model for co-occurring normal documents and short texts

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

User comments, as a large group of online short texts, are becoming increasingly prevalent with the development of online communications. These short texts are characterized by their co-occurrences with usually lengthier normal documents. For example, there could be multiple user comments following one news article, or multiple reader reviews following one blog post. The co-occurring structure inherent in such text corpora is important for efficient learning of topics, but is rarely captured by conventional topic models. To capture such structure, we propose a topic model for co-occurring documents, referred to as COTM. In COTM, we assume there are two sets of topics: formal topics and informal topics, where formal topics can appear in both normal documents and short texts whereas informal topics can only appear in short texts. Each normal document has a probability distribution over a set of formal topics; each short text is composed of two topics, one from the set of formal topics, whose selection is governed by the topic probabilities of the corresponding normal document, and the other from a set of informal topics. We also develop an online algorithm for COTM to deal with large scale corpus. Extensive experiments on real-world datasets demonstrate that COTM and its online algorithm outperform state-of-art methods by discovering more prominent, coherent and comprehensive topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. http://gibbslda.sourceforge.net/

  2. http://news.163.com/

  3. http://blog.sina.com.cn/

  4. https://pan.baidu.com/s/1boVox3p

  5. https://pypi.python.org/pypi/PyNLPIR/

  6. https://github.com/xiaohuiyan/OnlineBTM

  7. https://github.com/dongxiexidian/hdLDA

  8. https://github.com/dongxiexidian/ohdLDA

References

  1. AlSumait, L., Barbara, D., Domeniconi, C.: On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, pp. 3c12. IEEE (2008)

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993C1022 (2003)

    MATH  Google Scholar 

  3. Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling hidden topics on document manifold. In: Proceedings of the 17th ACM conference on information and knowledge management, pp. 911c920. ACM (2008)

  4. Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statisticstheory and Methods 3(1), 1C27 (1974)

    MathSciNet  MATH  Google Scholar 

  5. Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928C2941 (2014)

    Article  Google Scholar 

  6. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 1C24 (2015)

    Article  Google Scholar 

  7. Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. ISSN (PRINT) 4, 0975C7449 (2013)

    Google Scholar 

  8. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871C1874 (2008)

    MATH  Google Scholar 

  9. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 50c57. ACM (1999)

  10. Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp. 80c88. ACM (2010)

  11. Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th annual meeting of the association for computational linguistics, pp. 380 (2016)

  12. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp. 775c784. ACM (2011)

  13. Lakkaraju, H., Bhattacharya, I., Bhattacharyya, C.: Dynamic multi-relational chinese restaurant process for analyzing influences on users in social media. In: 2012 IEEE 12th international conference on data mining, pp. 389c398. IEEE (2012)

  14. Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: The international ACM SIGIR conference, pp. 165c174 (2016)

  15. Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, pp. 665c672. ACM (2009)

  16. Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM international conference on information and knowledge management, pp. 265c274. ACM (2012)

  17. McCallum, A., Wang, X., Mohanty, N.: Joint group and topic discovery from relations and text. Springer (2007)

  18. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp. 889c892. ACM (2013)

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Computer Science (2013)

  20. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp. 262c272 (2011)

  21. Natarajan, N., Sen, P., Chaoji, V.: Community detection in content-sharing social networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 82c89. ACM (2013)

  22. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp. 1532c1543 (2014)

  23. Phan X.H., Nguyen L.M., Horiguchi S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide Web, pp. 91c100. ACM (2008)

  24. Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short Web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961C976 (2011)

    Article  Google Scholar 

  25. Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270c2276 (2015)

  26. Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp. 261c270. ACM (2010)

  27. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd international conference on WorldWideWeb, InternationalWorldWideWeb conferences steering committee, pp. 1445c1456 (2013)

  28. Yang, Y., Wang, F., Jiang, F., Jin, S., Xu, J.: A topic model for hierarchical documents. In: International conference on data science in cyberspace, IEEE (2016)

  29. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338c349. Springer (2011)

  30. Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., Xiong, H.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2. ACM (2016)

Download references

Acknowledgements

This work is funded by the State Key Development Program of Basic Research of China (973) under Grant No. 2013cb329600 and National Natural Science Foundation of China under Grant Nos. 61672050, 61372191, 61472433, 61572492.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feifei Wang.

Appendix: Details of deriving the collapsed gibbs sampling algorithm

Appendix: Details of deriving the collapsed gibbs sampling algorithm

Given the full posterior distribution in (1), we can easily get the full conditional posterior distributions for Θ, Φ, Ψ, ξ and P.

For 𝜃 d ,d ∈{1, 2,...,D}, its full conditional posterior distribution is:

$$ \begin{array}{ll} f(\boldsymbol{\theta_{d}}\mid \cdot) \propto \prod\limits_{k=1}^{K}(\theta_{dk})^{l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha-1}. \end{array} $$
(17)

For ϕ k ,k ∈{1, 2,...,K}, its full conditional posterior distribution is:

$$ \begin{array}{ll} f(\boldsymbol{\phi_{k}}\mid \cdot) \propto \prod\limits_{v=1}^{V}(\phi_{kv})^{l_{kv}^{(2)}+g_{kv}^{(2)}+\beta-1}. \end{array} $$
(18)

For ψ j ,j ∈{1, 2,...,J}, its full conditional posterior distribution is:

$$ \begin{array}{ll} f(\boldsymbol{\psi_{j}}\mid \cdot) \propto \prod\limits_{v=1}^{V}(\psi_{jv})^{g_{jv}^{(3)}+\beta-1}. \end{array} $$
(19)

For ξ, its full conditional posterior distribution is:

$$ \begin{array}{ll} f(\boldsymbol{\xi}\mid \cdot) \propto \prod\limits_{j=1}^{J}(\epsilon_{j})^{h_{j}+\epsilon-1}. \end{array} $$
(20)

For p d c ,d ∈{1, 2,...,D},c ∈{1, 2,...,C d }, its full conditional posterior distribution is:

$$ \begin{array}{ll} f(p_{dc}\mid \cdot)\propto (p_{dc})^{s_{dc}^{(1)}+\gamma-1}(1-p_{dc})^{s_{dc}^{(2)}+\gamma-1}. \end{array} $$
(21)

Noting the posterior distributions of Θ, Φ, Ψ, ξ and P are all Dirichlet, conjugate with their priors, we can develop a collapsed Gibbs sampling algorithm by integrating out these parameters from the posterior distribution.

To describe this procedure, we start with introducing the Dirichlet distribution. Suppose X = (X 1,...,X K )T, following a Dirichlet distribution with parameter α = (α 1,...,α K )T. The probability density function of X is

$$ \begin{array}{ll} f(\boldsymbol{X}|\boldsymbol{\alpha})=f(X_{1},...,X_{K}|\alpha_{1},...,\alpha_{K})=\frac{\Gamma\left( {\sum}_{i=1}^{K} \alpha_{i}\right)}{{\prod}_{i=1}^{K} {\Gamma}(\alpha_{i})}{\prod}_{i=1}^{K} X_{i}^{\alpha_{i}-1}. \end{array} $$
(22)

Since the integral of f(X|α) is equal to 1, we can get

$$ \begin{array}{ll} &\int \left\{{\prod}_{i=1}^{K} X_{i}^{\alpha_{i}-1}\right\} dX_{1}...dX_{K} = \frac{{\prod}_{i=1}^{K} {\Gamma}(\alpha_{i})}{\Gamma\left( {\sum}_{i=1}^{K} \alpha_{i}\right)}. \end{array} $$
(23)

Similarly, given the conditional posterior distribution of Θ is Dirichlet, as described in (17), we can integrate it out and get:

$$ \begin{array}{ll} &\int f(\boldsymbol{\Theta}\mid \cdot) d\boldsymbol{\Theta} = {\prod}_{d=1}^{D} \int f(\boldsymbol{\theta_{d}}\mid \cdot) d\boldsymbol{\theta_{d}}\\ \propto &{\prod}_{d=1}^{D}\int \left\{\prod\limits_{k=1}^{K}(\theta_{dk})^{l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha-1}\right\} d\theta_{d1}...d\theta_{dK} ={\prod}_{d=1}^{D}\frac{{\prod}_{i=1}^{K} {\Gamma}\left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)}{\Gamma\left\{{\sum}_{k=1}^{K} (l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha)\right\}}. \end{array} $$
(24)

Then we integrate out Φ, Ψ, ξ and P similarly and get the following results:

$$ \begin{array}{ll} \int f(\boldsymbol{\Phi}\mid \cdot) d\boldsymbol{\Phi} &= {\prod}_{k=1}^{K} \int f(\boldsymbol{\phi_{k}}\mid \cdot) d\boldsymbol{\phi_{k}} \propto {\prod}_{k=1}^{K}\frac{{\prod}_{v=1}^{V} {\Gamma}\left( l_{kv}^{(2)}+g_{kv}^{(2)}+\beta\right)}{\Gamma\left\{{\sum}_{v=1}^{V} \left( l_{kv}^{(2)}+g_{kv}^{(2)}+\beta\right)\right\}}\\[-1pt] \int f(\boldsymbol{\Psi}\mid \cdot) d\boldsymbol{\Psi} &= {\prod}_{j=1}^{J} \int f(\boldsymbol{\psi_{j}}\mid \cdot) d\boldsymbol{\psi_{j}} \propto {\prod}_{j=1}^{J}\frac{{\prod}_{v=1}^{V} {\Gamma}\left( g_{jv}^{(3)}+\beta\right)}{\Gamma\left\{{\sum}_{v=1}^{V} \left( g_{jv}^{(3)}+\beta\right)\right\}}\\[-1pt] \int f(\boldsymbol{\xi}\mid \cdot) d\boldsymbol{\xi} &\propto \frac{{\prod}_{j=1}^{J} {\Gamma}(h_{j}+\epsilon)}{\Gamma\left\{{\sum}_{j=1}^{J} (h_{j}+\epsilon)\right\}}\\[-1pt] \int f(\boldsymbol{P}\mid \cdot) d\boldsymbol{P} &= {\prod}_{d=1}^{D}{\prod}_{c=1}^{C_{d}} \int f(p_{dc}\mid \cdot) dp_{dc} \propto {\prod}_{d=1}^{D}{\prod}_{c=1}^{C_{d}}\frac{\Gamma\left( s_{dc}^{(1)}+s_{dc}^{(2)}+\gamma\right)}{\Gamma\left( s_{dc}^{(1)}+\gamma\right){\Gamma}\left( s_{dc}^{(2)}+\gamma\right)} \end{array} $$
(25)

By integrating out Θ, Φ, Ψ, ξ and P, the full posterior distribution in (1) can be simplified as:

$$ \begin{array}{ll} &f(\boldsymbol{z},\boldsymbol{b},\boldsymbol{x},\boldsymbol{y} \mid \boldsymbol{w},\alpha,\beta,\gamma,\epsilon)\\ =&\int f(\boldsymbol{z},\boldsymbol{b},\boldsymbol{P},\boldsymbol{x},\boldsymbol{y},\boldsymbol{\Theta},\boldsymbol{\Phi},\boldsymbol{\Psi},\boldsymbol{\xi} \mid \boldsymbol{w},\alpha,\beta,\gamma,\epsilon) d\boldsymbol{\Theta} d\boldsymbol{\Phi} d\boldsymbol{\Psi} d\boldsymbol{\xi} d\boldsymbol{P} \\ \propto &{\prod}_{d=1}^{D}\frac{{\prod}_{i=1}^{K} {\Gamma}\left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)}{\Gamma\left\{{\sum}_{k=1}^{K} \left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)\right\}}\times {\prod}_{k=1}^{K}\frac{{\prod}_{v=1}^{V} {\Gamma}\left( l_{kv}^{(2)}+g_{kv}^{(2)}+\beta\right)}{\Gamma\left\{{\sum}_{v=1}^{V} \left( l_{kv}^{(2)}+g_{kv}^{(2)}+\beta\right)\right\}}\\[-1pt] \times\! &{\prod}_{j=1}^{J}\frac{{\prod}_{v=1}^{V} {\Gamma}\left( g_{jv}^{(3)}+\beta\right)} {\Gamma\left\{{\sum}_{v=1}^{V} (g_{jv}^{(3)}\,+\,\beta)\right\}} \!\times\! \frac{{\prod}_{j=1}^{J} {\Gamma}(h_{j}+\epsilon)}{\Gamma\left\{{\sum}_{j=1}^{J} (h_{j}\,+\,\epsilon)\right\}} \!\times\! {\prod}_{d=1}^{D}{\prod}_{c=1}^{C_{d}}\frac{\Gamma\left( s_{dc}^{(1)}\,+\,s_{dc}^{(2)}\,+\,\gamma\right)}{\Gamma\left( s_{dc}^{(1)}\,+\,\gamma){\Gamma}(s_{dc}^{(2)}\,+\,\gamma\right)}. \end{array} $$
(26)

Thus, we can use the collapsed Gibbs sampling and only need to update z, x, y and b in each iteration. We then derive the conditional posterior distributions of z, x, y and b from (26).

Specifically, for the n th word in normal document d, z d n = k only influences \(l_{dk}^{(1)}\) and \(l_{kw_{dn}}^{(2)}\) in (26). Let z d n denote z excluding z d n , and the full conditional distribution of z d n can be derived as:

$$ \begin{array}{lll} &&f(z_{dn}=k \mid \cdot)=\frac{f(z_{dn}=k,\boldsymbol{z}_{-dn} \mid \cdot)}{f(\boldsymbol{z}_{-dn} \mid \cdot)}\\ &&\propto\frac{\Gamma\left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)} {\Gamma\left( l_{dk;-dn}^{(1)}+g_{dk}^{(1)}+\alpha\right)}/ \frac{\Gamma\left\{{\sum}_{k' \neq k} \left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)+\left( l_{dk;-dn}^{(1)}+g_{dk}^{(1)}+\alpha\right)\right\}} {\Gamma\left\{{\sum}_{k' \neq k} \left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)+\left( l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha\right)\right\}}\\ &&\times\frac{\Gamma\left( l_{kw_{dn}}^{(2)}\,+\,g_{kw_{dn}}^{(2)}+\beta\right)}{\Gamma\left( l_{kw_{dn};-dn}^{(2)}+g_{kw_{dn}}^{(2)}+\beta\right)} / \frac{\Gamma\left\{{\sum}_{v\neq w_{dn}}\left( l_{kv}^{(2)}+g_{kv}^{(2)}\!+\beta\right)\,+\,\left( l_{kw_{dn};-dn}^{(2)}\,+\,g_{kw_{dn}}^{(2)}\,+\,\beta\right)\right\}} {\Gamma\left\{{\sum}_{v\neq w_{dn}}\left( l_{kv}^{(2)}+g_{kv}^{(2)}+\beta)+(l_{kw_{dn}}^{(2)}+g_{kw_{dn}}^{(2)}+\beta\right)\right\}}\\ &&\propto\frac{\Gamma\left( l_{dk;-dn}^{(1)}+1+g_{dk}^{(1)}+\alpha\right)} {\Gamma\left( l_{dk;-dn}^{(1)}+g_{dk}^{(1)}+\alpha\right)}/ \frac{\Gamma\left( l_{d.;-dn}^{(1)}+g_{d.}^{(1)}+K\alpha\right)} {\Gamma\left( l_{d.;-dn}^{(1)}+1+g_{d.}^{(1)}+K\alpha\right)}\\ &&\times\frac{\Gamma\left( l_{kw_{dn};-dn}^{(2)}+1+g_{kw_{dn}}^{(2)}+\beta\right)}{\Gamma\left( l_{kw_{dn};-dn}^{(2)}+g_{kw_{dn}}^{(2)}+\beta\right)} / \frac{\Gamma\left( l_{k.;-dn}^{(2)}+g_{k.}^{(2)}+V\beta\right)} {\Gamma\left( l_{k.;-dn}^{(2)}+1+g_{k.}^{(2)}+V\beta\right)},\\ \end{array} $$
(27)

where the subscript “ − d n” indicates counts excluding the n th word in normal document d, \(l_{d\cdot }^{(1)}\) and \(g_{d\cdot }^{(1)}\) are the sum of \(l_{dk}^{(1)}\) and \(g_{dk}^{(1)}\) over all formal topics k, and \(l_{k\cdot }^{(2)}\) and \(g_{k\cdot }^{(2)}\) are the sum of \(l_{kv}^{(2)}\) and \(g_{kv}^{(2)}\) over all words v. Noting \(l_{d\cdot }^{(1)}\) is equal to the total number of words in document d and \(g_{d\cdot }^{(1)}\) is equal to the total number of words in all short texts associated with normal document d, \(l_{d\cdot }^{(1)}\) and \(g_{d\cdot }^{(1)}\) are constant values. Using the characteristics of Γ function, which is Γ(x + 1) = xΓ(x), (27) can be simplified as

$$\begin{array}{lll} f(z_{dn}=k \mid \cdot) \propto \left( l_{dk;-dn}^{(1)}+g_{dk}^{(1)}+\alpha\right) \times \frac{l_{k,w_{dn};-dn}^{(2)}+g_{k,w_{dn}}^{(2)}+\beta}{l_{k\cdot;-dn}^{(2)}+g_{k\cdot}^{(2)}+V\beta}. \end{array} $$

For b,x,y, we can derive their conditional posterior distributions from (26) similarly.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Wang, F., Zhang, J. et al. A topic model for co-occurring normal documents and short texts. World Wide Web 21, 487–513 (2018). https://doi.org/10.1007/s11280-017-0467-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0467-8

Keywords

Navigation