Abstract
Although several parallel latent Dirichlet allocation (LDA) algorithms have been implemented to extract topic features from large-scale text data sets, very few studies compare their performance in real-world industrial applications. In this paper, we build a novel multi-channel MapReduce framework to compare fairly three representative parallel LDA algorithms such as parallel variational Bayes (PVB), parallel Gibbs sampling (PGS) and parallel belief propagation (PBP). Experimental results confirm that PGS yields the best application performance in search engine and online advertising system of Tencent, one of the biggest Internet companies in China, while PBP has the highest topic modeling accuracy. Moreover, PGS is more scalable in MapReduce framework than PVB and PBP because of its low memory usage and efficient sampling technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics 22(3), 400–407 (1951)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101, 5228–5235 (2004)
Porteous, I., Newman, D., Ihler, A.T., Asuncion, A.U., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: KDD, pp. 569–577 (2008)
Yao, L., Mimno, D.M., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: KDD, pp. 937–946 (2009)
Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In: NIPS, pp. 1353–1360 (2006)
Asuncion, A.U., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: UAI, pp. 27–34 (2009)
Zeng, J., Cheung, W.K., Liu, J.: Learning topic models by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1121–1134 (2013)
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)
Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: NIPS, pp. 2134–2142 (2009)
Asuncion, A.U., Smyth, P., Welling, M.: Asynchronous distributed learning of topic models. In: NIPS, pp. 81–88 (2008)
Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: parallel latent dirichlet allocation for large-scale applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)
Smola, A.J., Narayanamurthy, S.M.: An architecture for parallel topic models. PVLDB 3(1), 703–710 (2010)
Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM TIST 2(3), 26 (2011)
Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce. In: WWW, pp. 879–888 (2012)
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S.M., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132 (2012)
Wang, C., Blei, D.M., Li, F.F.: Simultaneous image classification and annotation. In: CVPR, pp. 1903–1910 (2009)
Yan, J., Zeng, J., Liu, Z.Q., Gao, Y.: Towards big topic modeling. arXiv:1311.4150 (2013)
Blei, D.M.: Introduction to probabilistic topic models. Communications of the ACM, 77–84 (2012)
Zeng, J.: A topic modeling toolbox using belief propagation. J. Mach. Learn. Res. 13, 2233–2236 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
Zhuang, Y., Chin, W.S., Juan, Y.C., Lin, C.J.: A fast parallel sgd for matrix factorization in shared memory systems. In: RecSys, pp. 249–256 (2013)
Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., Zeng, J.: Towards topic modeling for big data. ACM Transactions on Intelligent Systems and Technology (2014)
Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Gao, Y., Sun, Z., Wang, Y., Liu, X., Yan, J., Zeng, J. (2015). A Comparative Study on Parallel LDA Algorithms in MapReduce Framework. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-18032-8_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)