Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Li, Yang; Song, Wen-Zhuo; Yang, Bo

doi:10.1007/s11390-018-1871-y

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Regular Paper
Published: 12 September 2018

Volume 33, pages 1007–1022, (2018)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yang Li^1,2,3,
Wen-Zhuo Song^1,2 &
Bo Yang^1,2

107 Accesses
2 Citations
Explore all metrics

Abstract

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

How to Fine-Tune BERT for Text Classification?

A review of semi-supervised learning for text classification

Article 31 January 2023

References

Blei D M, Ng A Y, Jordan M Y. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
MATH Google Scholar
Blei D M, Mcauliffe J D. Supervised topic models. In Proc. Advances in Neural Information Processing Systems, December 2010, pp.327-332.
Wang C, Blei D M, Li F F. Simultaneous image classification and annotation. In Proc. IEEE Conference on Computer Vision & Pattern Recognition, January 2009, pp.1903-1910.
Zhu J, Ahmed A, Xing E P. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research, 2012, 13(1): 2237-2278.
MathSciNet MATH Google Scholar
Hoffman M, Blei D M, Wang C, Paisley J. Stochastic variational inference. Computer Science, 2013, 14(1): 1303-1347.
MathSciNet MATH Google Scholar
Hoffman S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(22): 79-86.
MathSciNet MATH Google Scholar
Amari S I. Differential geometry of curved exponential families-curvatures and information loss. Annals of Statistics, 1982, 10(2): 357-385.
Article MathSciNet Google Scholar
Song W Z, Yang B, Zhao X H, Li F. A fast and scalable supervised topic model using stochastic variational inference and MapReduce. In Proc. the 5th IEEE International Conference on Network Infrastructure and Digital Content, September 2016, pp.94-98.
Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.524-531.
Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In Proc. Advances in Neural Information Processing Systems, December 2004, pp.537-544.
Wang C, Blei D M. Collaborative topic modeling for recommending scientific articles. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.448-456.
Lacoste-Julien S, Sha F, Jordan M I. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proc. Advances in Neural Information Processing Systems, January 2008, pp.897-904.
Ramage D, Hall D, Nallapati R, Manning C D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.248-256.
Perotte A, Bartlett N, Bartlett N, Wood F. Hierarchically supervised latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, January 2011, pp.2609-2617.
Boyd-Graber J, Resnik P. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proc. the Conference on Empirical Methods in Natural Language Processing, October 2010, pp.45-55.
Chen J S, He J, Shen Y L, Xiao L, He X D, Gao J F, Song X Y, Deng L. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Proc. the 28th International Conference on Neural Information Processing Systems, December 2015, pp.1765-1773.
Zhai K, Boyd-Graber J, Asadi N, Alkhoujia K L. Mr.LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proc. ACM International Conference on World Wide Web, April 2012, pp.879-888.
White T. Hadoop: The Definitive Guide (2nd edition). Yahoo Press, 2010.
Yu H F, Hsieh C J, Yun H, Vishwanathan S V N, Dhilon I S. A scalable asynchronous distributed algorithm for topic modeling. In Proc. ACM International Conference on World Wide Web, May 2015, pp.1340-1350.
Liu X S, Zeng J, Yang X et al. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.669-679.
Yuan J, Gao F, Ho Q et al. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.1351-1361.
Raman P, Zhang J, Yu H F, Ji S H. Extreme stochastic variational inference: Distribution and asynchronous. arXiv preprint arXiv:1605.09499, 2016. http://cn.arxiv.org/abs/1605.09499, Aug. 2018.
Hoffman M D, Blei D M, Bach F R. Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, November 2010, pp.856-864.
Jordan M I. Learning in Graphical Models. MIT Press Cambridge, 1999.
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010.
Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge University Press, 2014.
Yano T, Smith N A, Wilkerson J D. Textual predictors of bill survival in congressional committees. In Proc. the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, pp.793-802.
Partalas I, Kosmopoulos A, Baskiotis N et al. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015. http://arxiv.org/abs/1503.08581, Aug. 2018.
Bizer C, Lehmann J, Kobilarov G et al. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 2009, 7(3): 154-165.
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Yang Li, Wen-Zhuo Song & Bo Yang
Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, 130012, China
Yang Li, Wen-Zhuo Song & Bo Yang
Aviation University of Air Force, Changchun, 130062, China
Yang Li

Authors

Yang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Zhuo Song
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Yang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 230 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Song, WZ. & Yang, B. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing. J. Comput. Sci. Technol. 33, 1007–1022 (2018). https://doi.org/10.1007/s11390-018-1871-y

Download citation

Received: 19 September 2017
Revised: 09 July 2018
Published: 12 September 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11390-018-1871-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

A review of semi-supervised learning for text classification

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

A review of semi-supervised learning for text classification

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation