Two time-efficient gibbs sampling inference algorithms for biterm topic model

Zhou, Xiaotang; Ouyang, Jihong; Li, Ximing

doi:10.1007/s10489-017-1004-2

Two time-efficient gibbs sampling inference algorithms for biterm topic model

Published: 31 July 2017

Volume 48, pages 730–754, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xiaotang Zhou^1,2,
Jihong Ouyang^1,2 &
Ximing Li^1,2

630 Accesses
8 Citations
Explore all metrics

Abstract

Biterm Topic Model (BTM) is an effective topic model proposed to handle short texts. However, its standard gibbs sampling inference method (StdBTM) costs much more time than that (StdLDA) of Latent Dirichlet Allocation (LDA). To solve this problem we propose two time-efficient gibbs sampling inference methods, SparseBTM and ESparseBTM, for BTM by making a tradeoff between space and time consumption in this paper. The idea of SparseBTM is to reduce the computation in StdBTM by both recycling intermediate results and utilizing the sparsity of count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\). Theoretically, SparseBTM reduces the time complexity of StdBTM from O(|B| K) to O(|B| K _w) which scales linearly with the sparsity of count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\) (K _w) instead of the number of topics (K) (K _w < K, K _w is the average number of non-zero topics per word type in count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\)). Experimental results have shown that in good conditions SparseBTM is approximately 18 times faster than StdBTM. Compared with SparseBTM, ESparseBTM is a more time-efficient gibbs sampling inference method proposed based on SparseBTM. The idea of ESparseBTM is to reduce more computation by recycling more intermediate results through rearranging biterm sequence. In theory, ESparseBTM reduces the time complexity of SparseBTM from O(|B|K _w) to O(R|B|K _w) (0 < R < 1, R is the ratio of the number of biterm types to the number of biterms). Experimental results have shown that the percentage of the time efficiency improved by ESparseBTM on SparseBTM is between 6.4% and 39.5% according to different datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Article 08 November 2022

Bag of biterms modeling for short texts

Article 10 July 2020

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Notes

References

Azzopardi L, Girolami M, van Risjbergen K (2003) Investigating the relationship between language model perplexity and ir precision-recall measures. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 369–370. ACM
Blei D, Carin L, Dunson D (2010) Probabilistic topic models. IEEE Signal Process Mag 27(6):55–65
Google Scholar
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84
Article Google Scholar
Blei DM, Lafferty JD (2009) Topic models. Text mining: classification, clustering, and applications 10 (71):34
MathSciNet Google Scholar
Blei DM, Ng AY, Jordan MI (2001) Latent Dirichlet allocation Advances in neural information processing systems, pp. 601–608
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Carter CK, Kohn R (1994) On gibbs sampling for state space models. Biometrika 81(3):541–553
Article MathSciNet MATH Google Scholar
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Article Google Scholar
Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am Stat 49(4):327–335
Google Scholar
Chuang J, Gupta S, Manning CD, Heer J (2013) Topic model diagnostics: assessing domain relevance via topical alignment ICML, pp. 612–620
Crain SP, Zhou K, Yang SH, Zha H (2012) Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond, pp. 129–161. Springer
Geweke J, Tanizaki H (2001) Bayesian estimation of state-space models using the metropolischastings algorithm within gibbs sampling. Comput Stat Data Anal 37(2):151–170
Article MATH Google Scholar
Gilks WR, Richardson S, Spiegelhalter D (1995) Markov chain Monte Carlo in practice. CRC press
Griffiths T (2002) Gibbs sampling in the generative model of latent Dirichlet allocation. Technical Report
Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Article Google Scholar
Guo W, Li H, Ji H, Diab MT (2013) Linking tweets to news: a framework to enrich short text data in social media. In: ACL (1), pp. 239–249. Citeseer
Heinrich G (2004) Parameter estimation for text analysis. Technical Report
Hong L, Davison BD (2011) Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM
Kronmal RA, Peterson AV (1979) On the alias method for generating random variables from a discrete distribution. Am Stat 33(4):214–218
MathSciNet MATH Google Scholar
Li AQ, Ahmed A, Ravi S, Smola AJ (2014) Reducing the sampling complexity of topic models. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 891–900. ACM
Li X, Ouyang J, Zhou X, Lu Y, Liu Y (2015) Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42(3):581–593
Article Google Scholar
Marsaglia G, Tsang WW, Wang J (2004) Fast generation of discrete random variables. J Stat Softw 11 (3):1–11
Article Google Scholar
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent Dirichlet allocation Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 569–577. ACM
Qiu Z, Wu B, Wang B, Shi C, Yu L (2014) Collapsed gibbs sampling for latent Dirichlet allocation on spark. J Mach Learn Res 36:17–28
Google Scholar
Roberts GO, Smith AF (1994) Simple conditions for the convergence of the gibbs sampler and metropolis-hastings algorithms. Stochastic Processes and their Applications 49(2):207–216
Article MathSciNet MATH Google Scholar
Smith AF, Roberts GO (1993) Bayesian computation via the gibbs sampler and related Markov chain Monte Carlo methods. J R Stat Soc Ser B Methodol 55:3–23
Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 841–842. ACM
Steyvers M, Griffiths T (2007) Probabilistic topic models. Handbook of Latent Semantic Analysis 427 (7):424–440
Google Scholar
Suhara Y, Toda H, Nishioka S, Susaki S (2013) Automatically generated spam detection based on sentence-level topic information. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1157–1160
Walker AJ (1977) An efficient method for generating discrete random variables with general distributions. ACM Trans Math Soft (TOMS) 3(3):253–256
Article MATH Google Scholar
Wang F, Wang Z, Li Z, Wen JR (2014) Concept-based short text classification and ranking. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1069–1078. ACM
Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp. 261–270. ACM
Xiao H, Stibor T (2010) Efficient collapsed gibbs sampling for latent Dirichlet allocation. In: ACML, pp. 63–78
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts Proceedings of the 22nd international conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee
Yao L, Mimno D, McCallum A (2009) Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 937–946. ACM
Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing EP, Liu TY, Ma WY (2015) Lightlda: Big topic models on modest computer clusters. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1351–1361. International World Wide Web Conferences Steering Committee
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models, pp. 338–349. Springer

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, No.2699, Qianjin Street, Chaoyang District, Changchun City, Jilin Province, China
Xiaotang Zhou, Jihong Ouyang & Ximing Li
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, No.2699, Qianjin Street, Chaoyang District, Changchun City, Jilin Province, China
Xiaotang Zhou, Jihong Ouyang & Ximing Li

Authors

Xiaotang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Ximing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jihong Ouyang.

Additional information

This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61170092, 61133011 and 61103091.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, X., Ouyang, J. & Li, X. Two time-efficient gibbs sampling inference algorithms for biterm topic model. Appl Intell 48, 730–754 (2018). https://doi.org/10.1007/s10489-017-1004-2

Download citation

Published: 31 July 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10489-017-1004-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two time-efficient gibbs sampling inference algorithms for biterm topic model

Abstract

Access this article

Similar content being viewed by others

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Bag of biterms modeling for short texts

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two time-efficient gibbs sampling inference algorithms for biterm topic model

Abstract

Access this article

Similar content being viewed by others

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Bag of biterms modeling for short texts

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation