Skip to main content
Log in

Online belief propagation algorithm for probabilistic latent semantic analysis

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memory-efficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online expectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613–620

    Article  MATH  Google Scholar 

  2. Thomas K, Landauer P W F, Laham A F. An introduction to latent semantic analysis. Communications of the ACM, 1998, 25: 259–284

    Google Scholar 

  3. Hoffman T. Probabilistic latent semantic analysis: uncertainty in artificial intelligence. 1999

    Google Scholar 

  4. Blei DM, Ng A Y, Jordan MI. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022

    MATH  Google Scholar 

  5. Canini K R, Shi L, Griffiths T L. Online inference of topics with latent dirichlet allocation. In: Proceedings of the 2009 International Conference on Artificial Intelligence and Statistics. 2009, 65–72

    Google Scholar 

  6. Zeng J, Cheung W K, Liu J. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 1

    Google Scholar 

  7. Zhuang L, She L, Jiang Y, Tang K, Yu N. Image classification via semi-supervised PLSA. In: Proceedings of the 5th International Conference on Image and Graphics. 2009, 205–208

    Google Scholar 

  8. Xu J, Ye G, Wang Y, Wang W, Yang J. Online learning for plsa-based visual recognition. Computer Vision-ACCV 2010, 2011, 95–108

    Chapter  Google Scholar 

  9. AlSumait L, Barbará D, Domeniconi C. On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 3–12

    Google Scholar 

  10. Yao L, Mimno D, McCallum A. Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 937–946

    Chapter  Google Scholar 

  11. Hoffman M D, Blei D M, Bach F. Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 2010, 23: 856–864

    Google Scholar 

  12. Wang C, Paisley J, Blei D M. Online variational inference for the hierarchical dirichlet process. In: Proceedings of the 14th Intenational Conference on Artificial Intelligence and Statistics. 2011, 752–760

    Google Scholar 

  13. Banerjee A, Basu S. Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 431–436

    Google Scholar 

  14. Nair V, Clark J J. An unsupervised, online learning framework for moving object detection. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004, II-317–II-324

    Google Scholar 

  15. Pham M T, Cham T J. Online learning asymmetric boosted classifiers for object detection. In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. 2007, 1–8

    Chapter  Google Scholar 

  16. Shalev-Shwartz S, Singer Y, Ng A Y. Online and batch learning of pseudo-metrics. In: Proceedings of the 21st International Conference on Machine Learning. 2004

    Google Scholar 

  17. Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 2010, 11: 19–60

    MathSciNet  MATH  Google Scholar 

  18. Vijayakumar S, D’souza A, Schaal S. Incremental online learning in high dimensions. Neural Computation, 2005, 17(12): 2602–2634

    Article  MathSciNet  Google Scholar 

  19. Kivinen J, Smola A J, Williamson R C. Online learning with kernels. IEEE Transactions on Signal Processing, 2004, 52(8): 2165–2176

    Article  MathSciNet  Google Scholar 

  20. Xu J, Ye G, Wang Y, Herman G, Zhang B, Yang J. Incremental EM for probabilistic latent semantic analysis on human action recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance. 2009, 55–60

    Google Scholar 

  21. Singh M, Khan F U. Effect of incremental EM on document summarization using probabilistic latent semantic analysis. Lecture Notes in Engineering and Computer Science, 2012, 2198

    Google Scholar 

  22. Bottou L. Online learning and stochastic approximations. On-line Learning in Neural Networks, 1998, 17: 9–42

    Google Scholar 

  23. Zhu S, Zeng J, Mamitsuka H. Enhancing medline document clustering by incorporating mesh semantic similarity. Bioinformatics, 2009, 25(15): 1944–1951

    Article  Google Scholar 

  24. Globerson A, Chechik G, Pereira F, Tishby N. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 2007, 8: 2047–2076

    MathSciNet  Google Scholar 

  25. Eisenstein J, Xing E. The CMU 2008 political blog corpus. Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2010

    Google Scholar 

  26. Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M. Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 569–577

    Chapter  Google Scholar 

  27. Zeng J. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research, 2012, 13: 2233–2236

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengrong Gong.

Additional information

Yun Ye received her BS in mathematical and applied mathematics from Nanjing University of Finance and Economics in 2010. She is currently an MS candidate at Soochow University, where she is studying topic modeling for dynamic network data.

Shengrong Gong received his MS from Harbin Institute of Technology in 1993, and his PhD from Beihang University in 2001. He is a professor and doctoral supervisor of the School of Computer Science and Technology, Soochow University. His research interests are image and video processing, pattern recognition, and computer vision.

Chunping Liu is an associate professor of the School of Computer Science and Technology, Soochow University. In 2002, she received her PhD in Pattern Recognition and Intelligent Systems Engineering from the Department of Computer Science, Nanjing University of Science and Technology. Her research interests are image and video processing, pattern recognition, and and computer vision.

Jia Zeng received his BE from Wuhan University of Technology, Wuhan, China, in 2002, and his PhD from the City University of Hong Kong, in 2007. He is a professor in the School of Computer Science and Technology, Soochow University. His research interests are machine learning and computational biology. He is a member of the CCF, the IEEE, and the ACM.

Ning Jia is a senior development engineer of Baidu, Inc. He received his PhD from the Institute of Acoustics, Chinese Academy of Sciences (IACAS) in 2008. His research interest is topic models.

Yi Zhang is a senior development engineer of Baidu, Inc. Currently he leads the key word recommender group of the Electronic Commerce department, and mainly focuses on sponsored search, machine learning, and data mining. He received his MS in Computer Science from Zhejiang University in 2008.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ye, Y., Gong, S., Liu, C. et al. Online belief propagation algorithm for probabilistic latent semantic analysis. Front. Comput. Sci. 7, 526–535 (2013). https://doi.org/10.1007/s11704-013-2360-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-013-2360-7

Keywords

Navigation