A Comparative Study on Parallel LDA Algorithms in MapReduce Framework

Gao, Yang; Sun, Zhenlong; Wang, Yi; Liu, Xiaosheng; Yan, Jianfeng; Zeng, Jia

doi:10.1007/978-3-319-18032-8_53

Yang Gao¹⁰,
Zhenlong Sun¹¹,
Yi Wang¹¹,
Xiaosheng Liu¹⁰,
Jianfeng Yan¹⁰ &
…
Jia Zeng¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4218 Accesses
2 Citations

Abstract

Although several parallel latent Dirichlet allocation (LDA) algorithms have been implemented to extract topic features from large-scale text data sets, very few studies compare their performance in real-world industrial applications. In this paper, we build a novel multi-channel MapReduce framework to compare fairly three representative parallel LDA algorithms such as parallel variational Bayes (PVB), parallel Gibbs sampling (PGS) and parallel belief propagation (PBP). Experimental results confirm that PGS yields the best application performance in search engine and online advertising system of Tencent, one of the biggest Internet companies in China, while PBP has the highest topic modeling accuracy. Moreover, PGS is more scalable in MapReduce framework than PVB and PBP because of its low memory usage and efficient sampling technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics 22(3), 400–407 (1951)
Article MATH MathSciNet Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101, 5228–5235 (2004)
Article Google Scholar
Porteous, I., Newman, D., Ihler, A.T., Asuncion, A.U., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: KDD, pp. 569–577 (2008)
Google Scholar
Yao, L., Mimno, D.M., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: KDD, pp. 937–946 (2009)
Google Scholar
Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In: NIPS, pp. 1353–1360 (2006)
Google Scholar
Asuncion, A.U., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: UAI, pp. 27–34 (2009)
Google Scholar
Zeng, J., Cheung, W.K., Liu, J.: Learning topic models by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1121–1134 (2013)
Article Google Scholar
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)
Google Scholar
Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: NIPS, pp. 2134–2142 (2009)
Google Scholar
Asuncion, A.U., Smyth, P., Welling, M.: Asynchronous distributed learning of topic models. In: NIPS, pp. 81–88 (2008)
Google Scholar
Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: parallel latent dirichlet allocation for large-scale applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)
Chapter Google Scholar
Smola, A.J., Narayanamurthy, S.M.: An architecture for parallel topic models. PVLDB 3(1), 703–710 (2010)
Google Scholar
Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM TIST 2(3), 26 (2011)
Google Scholar
Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce. In: WWW, pp. 879–888 (2012)
Google Scholar
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S.M., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132 (2012)
Google Scholar
Wang, C., Blei, D.M., Li, F.F.: Simultaneous image classification and annotation. In: CVPR, pp. 1903–1910 (2009)
Google Scholar
Yan, J., Zeng, J., Liu, Z.Q., Gao, Y.: Towards big topic modeling. arXiv:1311.4150 (2013)
Blei, D.M.: Introduction to probabilistic topic models. Communications of the ACM, 77–84 (2012)
Google Scholar
Zeng, J.: A topic modeling toolbox using belief propagation. J. Mach. Learn. Res. 13, 2233–2236 (2012)
MATH MathSciNet Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
Google Scholar
Zhuang, Y., Chin, W.S., Juan, Y.C., Lin, C.J.: A fast parallel sgd for matrix factorization in shared memory systems. In: RecSys, pp. 249–256 (2013)
Google Scholar
Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., Zeng, J.: Towards topic modeling for big data. ACM Transactions on Intelligent Systems and Technology (2014)
Google Scholar
Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Yang Gao, Xiaosheng Liu, Jianfeng Yan & Jia Zeng
Tencent, Peking, 100080, China
Zhenlong Sun & Yi Wang

Authors

Yang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Zhenlong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jia Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zeng .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tru Cao
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, Hong Kong SAR
David Cheung
Osaka University, Osaka, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Y., Sun, Z., Wang, Y., Liu, X., Yan, J., Zeng, J. (2015). A Comparative Study on Parallel LDA Algorithms in MapReduce Framework. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-18032-8_53
Published: 09 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics