Skip to main content
Log in

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Blei D M, Ng A Y, Jordan M Y. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.

    MATH  Google Scholar 

  2. Blei D M, Mcauliffe J D. Supervised topic models. In Proc. Advances in Neural Information Processing Systems, December 2010, pp.327-332.

  3. Wang C, Blei D M, Li F F. Simultaneous image classification and annotation. In Proc. IEEE Conference on Computer Vision & Pattern Recognition, January 2009, pp.1903-1910.

  4. Zhu J, Ahmed A, Xing E P. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research, 2012, 13(1): 2237-2278.

    MathSciNet  MATH  Google Scholar 

  5. Hoffman M, Blei D M, Wang C, Paisley J. Stochastic variational inference. Computer Science, 2013, 14(1): 1303-1347.

    MathSciNet  MATH  Google Scholar 

  6. Hoffman S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(22): 79-86.

    MathSciNet  MATH  Google Scholar 

  7. Amari S I. Differential geometry of curved exponential families-curvatures and information loss. Annals of Statistics, 1982, 10(2): 357-385.

    Article  MathSciNet  Google Scholar 

  8. Song W Z, Yang B, Zhao X H, Li F. A fast and scalable supervised topic model using stochastic variational inference and MapReduce. In Proc. the 5th IEEE International Conference on Network Infrastructure and Digital Content, September 2016, pp.94-98.

  9. Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.524-531.

  10. Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In Proc. Advances in Neural Information Processing Systems, December 2004, pp.537-544.

  11. Wang C, Blei D M. Collaborative topic modeling for recommending scientific articles. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.448-456.

  12. Lacoste-Julien S, Sha F, Jordan M I. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proc. Advances in Neural Information Processing Systems, January 2008, pp.897-904.

  13. Ramage D, Hall D, Nallapati R, Manning C D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.248-256.

  14. Perotte A, Bartlett N, Bartlett N, Wood F. Hierarchically supervised latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, January 2011, pp.2609-2617.

  15. Boyd-Graber J, Resnik P. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proc. the Conference on Empirical Methods in Natural Language Processing, October 2010, pp.45-55.

  16. Chen J S, He J, Shen Y L, Xiao L, He X D, Gao J F, Song X Y, Deng L. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Proc. the 28th International Conference on Neural Information Processing Systems, December 2015, pp.1765-1773.

  17. Zhai K, Boyd-Graber J, Asadi N, Alkhoujia K L. Mr.LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proc. ACM International Conference on World Wide Web, April 2012, pp.879-888.

  18. White T. Hadoop: The Definitive Guide (2nd edition). Yahoo Press, 2010.

  19. Yu H F, Hsieh C J, Yun H, Vishwanathan S V N, Dhilon I S. A scalable asynchronous distributed algorithm for topic modeling. In Proc. ACM International Conference on World Wide Web, May 2015, pp.1340-1350.

  20. Liu X S, Zeng J, Yang X et al. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.669-679.

  21. Yuan J, Gao F, Ho Q et al. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.1351-1361.

  22. Raman P, Zhang J, Yu H F, Ji S H. Extreme stochastic variational inference: Distribution and asynchronous. arXiv preprint arXiv:1605.09499, 2016. http://cn.arxiv.org/abs/1605.09499, Aug. 2018.

  23. Hoffman M D, Blei D M, Bach F R. Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, November 2010, pp.856-864.

  24. Jordan M I. Learning in Graphical Models. MIT Press Cambridge, 1999.

  25. Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010.

  26. Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge University Press, 2014.

  27. Yano T, Smith N A, Wilkerson J D. Textual predictors of bill survival in congressional committees. In Proc. the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, pp.793-802.

  28. Partalas I, Kosmopoulos A, Baskiotis N et al. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015. http://arxiv.org/abs/1503.08581, Aug. 2018.

  29. Bizer C, Lehmann J, Kobilarov G et al. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 2009, 7(3): 154-165.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Yang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 230 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Song, WZ. & Yang, B. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing. J. Comput. Sci. Technol. 33, 1007–1022 (2018). https://doi.org/10.1007/s11390-018-1871-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-018-1871-y

Keywords

Navigation