Skip to main content

A Comparative Study on Parallel LDA Algorithms in MapReduce Framework

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Abstract

Although several parallel latent Dirichlet allocation (LDA) algorithms have been implemented to extract topic features from large-scale text data sets, very few studies compare their performance in real-world industrial applications. In this paper, we build a novel multi-channel MapReduce framework to compare fairly three representative parallel LDA algorithms such as parallel variational Bayes (PVB), parallel Gibbs sampling (PGS) and parallel belief propagation (PBP). Experimental results confirm that PGS yields the best application performance in search engine and online advertising system of Tencent, one of the biggest Internet companies in China, while PBP has the highest topic modeling accuracy. Moreover, PGS is more scalable in MapReduce framework than PVB and PBP because of its low memory usage and efficient sampling technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics 22(3), 400–407 (1951)

    Article  MATH  MathSciNet  Google Scholar 

  3. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101, 5228–5235 (2004)

    Article  Google Scholar 

  4. Porteous, I., Newman, D., Ihler, A.T., Asuncion, A.U., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: KDD, pp. 569–577 (2008)

    Google Scholar 

  5. Yao, L., Mimno, D.M., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: KDD, pp. 937–946 (2009)

    Google Scholar 

  6. Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In: NIPS, pp. 1353–1360 (2006)

    Google Scholar 

  7. Asuncion, A.U., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: UAI, pp. 27–34 (2009)

    Google Scholar 

  8. Zeng, J., Cheung, W.K., Liu, J.: Learning topic models by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1121–1134 (2013)

    Article  Google Scholar 

  9. Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)

    Google Scholar 

  10. Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: NIPS, pp. 2134–2142 (2009)

    Google Scholar 

  11. Asuncion, A.U., Smyth, P., Welling, M.: Asynchronous distributed learning of topic models. In: NIPS, pp. 81–88 (2008)

    Google Scholar 

  12. Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: parallel latent dirichlet allocation for large-scale applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Smola, A.J., Narayanamurthy, S.M.: An architecture for parallel topic models. PVLDB 3(1), 703–710 (2010)

    Google Scholar 

  14. Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM TIST 2(3), 26 (2011)

    Google Scholar 

  15. Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce. In: WWW, pp. 879–888 (2012)

    Google Scholar 

  16. Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S.M., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132 (2012)

    Google Scholar 

  17. Wang, C., Blei, D.M., Li, F.F.: Simultaneous image classification and annotation. In: CVPR, pp. 1903–1910 (2009)

    Google Scholar 

  18. Yan, J., Zeng, J., Liu, Z.Q., Gao, Y.: Towards big topic modeling. arXiv:1311.4150 (2013)

  19. Blei, D.M.: Introduction to probabilistic topic models. Communications of the ACM, 77–84 (2012)

    Google Scholar 

  20. Zeng, J.: A topic modeling toolbox using belief propagation. J. Mach. Learn. Res. 13, 2233–2236 (2012)

    MATH  MathSciNet  Google Scholar 

  21. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)

    Google Scholar 

  22. Zhuang, Y., Chin, W.S., Juan, Y.C., Lin, C.J.: A fast parallel sgd for matrix factorization in shared memory systems. In: RecSys, pp. 249–256 (2013)

    Google Scholar 

  23. Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., Zeng, J.: Towards topic modeling for big data. ACM Transactions on Intelligent Systems and Technology (2014)

    Google Scholar 

  24. Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zeng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Gao, Y., Sun, Z., Wang, Y., Liu, X., Yan, J., Zeng, J. (2015). A Comparative Study on Parallel LDA Algorithms in MapReduce Framework. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18032-8_53

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18031-1

  • Online ISBN: 978-3-319-18032-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics