Skip to main content
Log in

Short text topic modeling by exploring original documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Topic modeling for short texts faces a tough challenge, owing to the sparsity problem. An effective solution is to aggregate short texts into long pseudo-documents before training a standard topic model. The main concern of this solution is the way of aggregating short texts. A recent developed self-aggregation-based topic model (SATM) can adaptively aggregate short texts without using heuristic information. However, the model definition of SATM is a bit rigid, and more importantly, it tends to overfitting and time-consuming for large-scale corpora. To improve SATM, we propose a generalized topic model for short texts, namely latent topic model (LTM). In LTM, we assume that the observable short texts are snippets of normal long texts (namely original documents) generated by a given standard topic model, but their original document memberships are unknown. With Gibbs sampling, LTM drives an adaptive aggregation process of short texts, and simultaneously estimates other latent variables of interest. Additionally, we propose a mini-batch scheme for fast inference. Experimental results indicate that LTM is competitive with the state-of-the-art baseline models on short text topic modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://jwebpro.sourceforge.net/data-web-snippets.tar.gz.

  2. http://web.ist.utl.pt/~acardoso/datasets/.

  3. http://papers.nips.cc/.

  4. http://code.google.com/p/btm/.

  5. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

References

  1. Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL conference, pp 31–40

  2. Blei DM, Lafferty JD (2007) A correlated topic model fo science. Ann Appl Stat 1(2):17–35

    Article  MathSciNet  MATH  Google Scholar 

  3. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  4. Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Neural information processing systems, pp 288–296

  5. Chen W, Wang J, Zhang Y, Yan H, Li X (2015) User based aggregation for biterm topic model. In: Annual meeting of the association for computational linguistics and international joint conference on natural language processing of the Asian Federation of Natural Language Processing, pp 489–494

  6. Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941

    Article  Google Scholar 

  7. Gangemi A, Presutti V, Recupero DR (2014) Frame-based detection of opinion holders and topics: a model and a tool. IEEE Comput Intell Mag 9(1):20–30

    Article  Google Scholar 

  8. Griffiths TL, Steyvers M (2004) Finding scientific topics. Natl Acad Sci USA 101(Suppl. 1):5228–5235

    Article  Google Scholar 

  9. Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, New York, pp 80–88

  10. Lakkaraju H, Bhattacharya I, Bhattacharyys C (2012) Dynamic multi-relational Chinese restaurant process for analyzing influences on users in social media. In: International conference on data mining. IEEE, pp 389–398

  11. Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Conference of the European chapter of the Association for Computational Linguistics, pp 530–539

  12. Lau RYK, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43

    Article  Google Scholar 

  13. Li X, Ouyang J, You L, Zhou X, Tian T (2015) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25

    Article  Google Scholar 

  14. Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: International ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 889–892

  15. Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Conference on empirical methods in natural language processing, pp 262–272

  16. Newman D, Lau HJ, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, pp 100–108

  17. Nigam K, Mccallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2):103–134

    Article  MATH  Google Scholar 

  18. Phan X, Nguyen, L (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: International conference on world wide web. ACM, New York, pp 91–100

  19. Poria S, Chaturvedi I, Cambria E, Bisio F (2016) Sentic LDA: improving on LDA with semantic similarity for aspect-based sentiment analysis. In: International joint conference on neural networks, pp 4465–4473

  20. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: International joint conference on artificial intelligence. AAAI Press, pp 2270–2276

  21. Roder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: ACM international conference on web search and data mining, pp 399–408

  22. Sasaki K, Yoshikawa T, Furuhashi T (2014) Twitter-TTM: an efficient online topic modeling for Twitter considering dynamics of user interests and topic trends. In: 15th international symposium on soft computing and intelligent systems (SCIS), joint 7th international conference on and advanced intelligent systems (ISIS). IEEE, pp 440–445

  23. Sridhar VKR (2015) Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of NAACL-HLT, pp 192–200

  24. Weng J, Lim E, Jiang J, He Q (2010) TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, New York, pp 261–270

  25. Teh YW, Jordan MI, Beal MG, Blei DM (2007) Hierarchical Dirichlet Processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  26. Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089

    Article  Google Scholar 

  27. Xia Y, Tang N, Hussain A, Cambria E (2015) Discriminative bi-term topic model for headline-based social news clustering. In: International Florida artificial intelligence research society conference. AAAI Press, pp 311–316

  28. Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: international conference on world wide web. ACM, New York, pp 1445–1456

  29. Yan X, Guo J, Lan Y, Xu J, Cheng X (2015) A probabilistic model for bursty topic discovery in microblogs. In: Association for the advancement of artificial intelligence, pp 353–359

  30. Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 233–242

  31. Zhao WX, Jiang J, Weng J, He J, Lim E, Yan H, Li X (2011) Comparing Twitter and traditional media using topic models. In: Proceedings of the 33rd European conference on advances in information retrieval. Springer, Heidelberg, pp 338–349

  32. Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) Grant Numbers 61602204 and 61472157).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jihong Ouyang.

Appendices

Appendix

A. Derivation of Gibbs sampling LTM

We derive the Gibbs sampling equations for LTM. In terms of LTM, four latent variables of interest exist, including the topic-word distribution \(\phi \), the original document-topic distribution \(\theta \), the original document assignment for short texts \({\hat{z}}\) and the topic assignment for word tokens z. Thanks to the conjugate Dirichlet-Multinomial design for \(\phi \) and \(\theta \), we can effectively marginalize out these two variables, and turn to the joint distribution of the observation W and the two assignment variables \({\hat{z}}\) and z as follows:

$$\begin{aligned}&p\left( {W,{\hat{z}},z|\beta ,\alpha } \right) = \int {\int {p\left( {W,\phi ,\theta ,{\hat{z}},z|\beta ,\alpha } \right) \hbox {d}\phi } \hbox {d}\theta } \nonumber \\&\quad = \int {\int {\prod \limits _{k = 1}^K {Dir(\phi |\beta )} \prod \limits _{d = 1}^D {Dir(\theta |\alpha )} \prod \limits _{v = 1}^V {\prod \limits _{k = 1}^K {\prod \limits _{d = 1}^D {\phi _{kv}^{{N_{kv}}}\theta _{dk}^{{N_{dk}}}} } } \hbox {d}\phi } \hbox {d}\theta } \nonumber \\&\quad = \left( {\prod \limits _{k = 1}^K {\frac{{\prod \nolimits _{v = 1}^V {\varGamma \left( {{N_{kv}} + \beta } \right) } }}{{\varGamma \left( {{N_k} + V\beta } \right) }}\frac{{\varGamma \left( {V\beta } \right) }}{{\prod \nolimits _{v = 1}^V {\varGamma \left( \beta \right) } }}} } \right) \left( {\prod \limits _{d = 1}^D {\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{dk}} + \alpha } \right) } }}{{\varGamma \left( {{N_d} + K\alpha } \right) }}\frac{{\varGamma \left( {K\alpha } \right) }}{{\prod \nolimits _{k = 1}^K {\varGamma \left( \alpha \right) } }}} } \right) \nonumber \\&\qquad \propto \left( {\prod \limits _{k = 1}^K {\frac{{\prod \nolimits _{v = 1}^V {\varGamma \left( {{N_{kv}} + \beta } \right) } }}{{\varGamma \left( {{N_k} + V\beta } \right) }}} } \right) \left( {\prod \limits _{d = 1}^D {\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{dk}} + \alpha } \right) } }}{{\varGamma \left( {{N_d} + K\alpha } \right) }}} } \right) \nonumber \\&\quad \buildrel \varDelta \over = B\left( {{N_{kv}},{N_{k,}}\beta } \right) B\left( {{N_{dk}},{N_{d,}}\alpha } \right) \end{aligned}$$
(11)

where the fourth line in Eq. (11) follows that \(\beta \) and \(\alpha \) are constant; notations \(B\left( {{N_{kv}},{N_{k,}}\beta } \right) \) and \(B\left( {{N_{dk}},{N_{d,}}\alpha } \right) \) are used to denote \(\prod \nolimits _{k = 1}^K {\frac{{\prod \nolimits _{v = 1}^V {\varGamma \left( {{N_{kv}} + \beta } \right) } }}{{\varGamma \left( {{N_k} + V\beta } \right) }}} \) and \({\prod \nolimits _{d = 1}^D {\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{dk}} + \alpha } \right) } }}{{\varGamma \left( {{N_d} + K\alpha } \right) }}} }\) for convenience.

We employ the blocked Gibbs sampling framework, where in each iteration \({\hat{z}}\) and z are alternately sampled given the other one. We derive the Gibbs sampling equations for \({\hat{z}}\) and z one by one.

Sampling equation of \({\hat{z}}\) given the current z To draw a sample of \({\hat{z}}\), following Gibbs sampling idea we consecutively draw a single original document assignment \({\hat{z}}_s\) from a posterior conditioned on all other original document assignments \({\hat{z}}^{-s}\):

$$\begin{aligned}&p\left( {{{{\hat{z}}}_s}|{{{\hat{z}}}^{ - s}},z,W, \beta ,\alpha } \right) = \frac{{p\left( {W,{\hat{z}}|z,\beta ,\alpha } \right) }}{{p\left( {W,{{{\hat{z}}}^{ - s}}|z,\beta ,\alpha } \right) }} \propto \frac{{p\left( {W,{\hat{z}}|z,\beta ,\alpha } \right) }}{{p\left( {{W^{ - s}},{{{\hat{z}}}^{ - s}}|z,\beta ,\alpha } \right) }} \nonumber \\&\quad = \frac{{p\left( {W,{\hat{z}},z|\beta ,\alpha } \right) }}{{p\left( {{W^{ - s}},{{{\hat{z}}}^{ - s}},z|\beta ,\alpha } \right) }} \propto \frac{{p\left( {W,{\hat{z}},z|\beta ,\alpha } \right) }}{{p\left( {{W^{ - s}},{{{\hat{z}}}^{ - s}},{z^{ - s}}|\beta ,\alpha } \right) }} \end{aligned}$$
(12)

By combing Eqs. (11) and (12), we have:

$$\begin{aligned}&p\left( {{{{\hat{z}}}_s} = d|{{{\hat{z}}}^{ - s}},z,W,\beta ,\alpha } \right) \propto \frac{{B\left( {{N_{kv}},{N_k},\beta } \right) B\left( {{N_{dk}},{N_d},\alpha } \right) }}{{B\left( {N_{kv}^{ - s},N_k^{ - s},\beta } \right) B\left( {N_{dk}^{ - s},N_d^{ - s},\alpha } \right) }} \nonumber \\&\quad \propto \frac{{B\left( {{N_{dk}},{N_d},\alpha } \right) }}{{B\left( {N_{dk}^{ - s},N_d^{ - s},\alpha } \right) }} \end{aligned}$$
(13)

The second line in Eq. (13) follows that the terms with respect to topic-word counts are independent of the current assignment \({\hat{z}}_s\). We then expand Eq. (13) and obtain the final Gibbs sampling equation for \({\hat{z}}\) (i.e., Eq. 1):

$$\begin{aligned}&p\left( {{{{\hat{z}}}_s} = d|{{{\hat{z}}}^{ - s}},z,W,\alpha } \right) \propto \frac{{\prod \nolimits _{j = 1}^D {\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{jk}} + \alpha } \right) } }}{{\varGamma \left( {{N_j} + K\alpha } \right) }}} }}{{\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {N_{dk}^{ - s} + \alpha } \right) } }}{{\varGamma \left( {N_d^{ - s} + K\alpha } \right) }}\prod \nolimits _{j \ne d} {\frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{jk}} + \alpha } \right) } }}{{\varGamma \left( {{N_j} + K\alpha } \right) }}} }} \nonumber \\&\quad = \frac{{\prod \nolimits _{k = 1}^K {\varGamma \left( {{N_{dk}} + \alpha } \right) } }}{{\prod \nolimits _{k = 1}^K {\varGamma \left( {N_{dk}^{ - s} + \alpha } \right) } }}\frac{{\varGamma \left( {N_d^{ - s} + K\alpha } \right) }}{{\varGamma \left( {{N_d} + K\alpha } \right) }} \nonumber \\&\quad = \frac{{\prod \nolimits _{k = 1}^K {\prod \nolimits _{n = 1}^{{N_{sk}}} {\left( {N_{dk}^{ - s} + n - 1 + \alpha } \right) } } }}{{\prod \nolimits _{n = 1}^{{N_s}} {\left( {N_d^{ - s} + n - 1 + K\alpha } \right) } }} \end{aligned}$$
(14)

The third line in Eq. (14) follows the fact (\(m>n\)):

$$\begin{aligned} \frac{{\varGamma \left( n \right) }}{{\varGamma \left( m \right) }} = \frac{{\varGamma \left( n \right) }}{{\varGamma \left( {n + 1} \right) }}\frac{{\varGamma \left( {n + 1} \right) }}{{\varGamma \left( {n + 2} \right) }}\frac{{\varGamma \left( {n + 2} \right) }}{{\varGamma \left( {n + 3} \right) }} \cdots \frac{{\varGamma \left( {m - 1} \right) }}{{\varGamma \left( m \right) }} = \frac{1}{{\prod \nolimits _{i = 1}^{m - n} {\left( {n + i - 1} \right) } }} \end{aligned}$$

Sampling equation of z given the current \({\hat{z}}\) We now turn to the sampling of z. Note that given \({\hat{z}}\), the case is completely equivalent to the standard LDA Gibbs sampling. Similar with the derivations of Eqs. (12) and (13), the posterior of a single topic assignment \(z_{dn}\) conditioned on all other topic assignments \(z^{-dn}\) is given by:

$$\begin{aligned}&p\left( {{z_{dn}}|{\hat{z}},z,W,\beta ,\alpha } \right) \propto \frac{{p\left( {W,{\hat{z}},z|\beta ,\alpha } \right) }}{{p\left( {{W^{ - s}},{{{\hat{z}}}^{ - s}},{z^{ - s}}|\beta ,\alpha } \right) }} \nonumber \\&\quad = \frac{{B\left( {{N_{kv}},{N_k},\beta } \right) B\left( {{N_{dk}},{N_d},\alpha } \right) }}{{B\left( {N_{kv}^{ - s},N_k^{ - s},\beta } \right) B\left( {N_{dk}^{ - s},N_d^{ - s},\alpha } \right) }} \end{aligned}$$
(15)

By expanding Eq. (15), we derive the final Gibbs sampling equation for z (i.e., Eq. 2):

$$\begin{aligned}&p\left( {{z_{dn}} = k|{\hat{z}},{z^{ - dn}},W,\beta ,\alpha } \right) \propto \frac{{\frac{{\varGamma \left( {{N_{k{w_{dn}}}} + \beta } \right) }}{{\varGamma \left( {{N_k} + V\beta } \right) }}\frac{{\varGamma \left( {{N_{k{w_{dn}}}} + \beta } \right) }}{{\varGamma \left( {{N_k} + V\beta } \right) }}}}{{\frac{{\varGamma \left( {N_{k{w_{dn}}}^{ - dn} + \beta } \right) }}{{\varGamma \left( {N_k^{ - dn} + V\beta } \right) }}\frac{{\varGamma \left( {N_{dk}^{ - dn} + \alpha } \right) }}{{\varGamma \left( {N_d^{ - dn} + K\alpha } \right) }}}} \nonumber \\&\quad = \frac{{N_{k{w_{dn}}}^{ - dn} + \beta }}{{N_k^{ - dn} + V\beta }}\frac{{N_{dk}^{ - dn} + \alpha }}{{N_d^{ - dn} + K\alpha }} \nonumber \\&\qquad \propto \frac{{N_{k{w_{dn}}}^{ - dn} + \beta }}{{N_k^{ - dn} + V\beta }}\left( {N_{dk}^{ - dn} + \alpha } \right) \end{aligned}$$
(16)

The third line in Eq. (16) follows that the term \((N_d^{ - dn} + K\alpha )\) is independent of the current topic assignment \(z_{dn}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Li, C., Chi, J. et al. Short text topic modeling by exploring original documents. Knowl Inf Syst 56, 443–462 (2018). https://doi.org/10.1007/s10115-017-1099-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1099-0

Keywords

Navigation