Abstract
Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memory-efficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online expectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.
Similar content being viewed by others
References
Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613–620
Thomas K, Landauer P W F, Laham A F. An introduction to latent semantic analysis. Communications of the ACM, 1998, 25: 259–284
Hoffman T. Probabilistic latent semantic analysis: uncertainty in artificial intelligence. 1999
Blei DM, Ng A Y, Jordan MI. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022
Canini K R, Shi L, Griffiths T L. Online inference of topics with latent dirichlet allocation. In: Proceedings of the 2009 International Conference on Artificial Intelligence and Statistics. 2009, 65–72
Zeng J, Cheung W K, Liu J. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 1
Zhuang L, She L, Jiang Y, Tang K, Yu N. Image classification via semi-supervised PLSA. In: Proceedings of the 5th International Conference on Image and Graphics. 2009, 205–208
Xu J, Ye G, Wang Y, Wang W, Yang J. Online learning for plsa-based visual recognition. Computer Vision-ACCV 2010, 2011, 95–108
AlSumait L, Barbará D, Domeniconi C. On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 3–12
Yao L, Mimno D, McCallum A. Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 937–946
Hoffman M D, Blei D M, Bach F. Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 2010, 23: 856–864
Wang C, Paisley J, Blei D M. Online variational inference for the hierarchical dirichlet process. In: Proceedings of the 14th Intenational Conference on Artificial Intelligence and Statistics. 2011, 752–760
Banerjee A, Basu S. Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 431–436
Nair V, Clark J J. An unsupervised, online learning framework for moving object detection. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004, II-317–II-324
Pham M T, Cham T J. Online learning asymmetric boosted classifiers for object detection. In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. 2007, 1–8
Shalev-Shwartz S, Singer Y, Ng A Y. Online and batch learning of pseudo-metrics. In: Proceedings of the 21st International Conference on Machine Learning. 2004
Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 2010, 11: 19–60
Vijayakumar S, D’souza A, Schaal S. Incremental online learning in high dimensions. Neural Computation, 2005, 17(12): 2602–2634
Kivinen J, Smola A J, Williamson R C. Online learning with kernels. IEEE Transactions on Signal Processing, 2004, 52(8): 2165–2176
Xu J, Ye G, Wang Y, Herman G, Zhang B, Yang J. Incremental EM for probabilistic latent semantic analysis on human action recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance. 2009, 55–60
Singh M, Khan F U. Effect of incremental EM on document summarization using probabilistic latent semantic analysis. Lecture Notes in Engineering and Computer Science, 2012, 2198
Bottou L. Online learning and stochastic approximations. On-line Learning in Neural Networks, 1998, 17: 9–42
Zhu S, Zeng J, Mamitsuka H. Enhancing medline document clustering by incorporating mesh semantic similarity. Bioinformatics, 2009, 25(15): 1944–1951
Globerson A, Chechik G, Pereira F, Tishby N. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 2007, 8: 2047–2076
Eisenstein J, Xing E. The CMU 2008 political blog corpus. Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2010
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M. Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 569–577
Zeng J. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research, 2012, 13: 2233–2236
Author information
Authors and Affiliations
Corresponding author
Additional information
Yun Ye received her BS in mathematical and applied mathematics from Nanjing University of Finance and Economics in 2010. She is currently an MS candidate at Soochow University, where she is studying topic modeling for dynamic network data.
Shengrong Gong received his MS from Harbin Institute of Technology in 1993, and his PhD from Beihang University in 2001. He is a professor and doctoral supervisor of the School of Computer Science and Technology, Soochow University. His research interests are image and video processing, pattern recognition, and computer vision.
Chunping Liu is an associate professor of the School of Computer Science and Technology, Soochow University. In 2002, she received her PhD in Pattern Recognition and Intelligent Systems Engineering from the Department of Computer Science, Nanjing University of Science and Technology. Her research interests are image and video processing, pattern recognition, and and computer vision.
Jia Zeng received his BE from Wuhan University of Technology, Wuhan, China, in 2002, and his PhD from the City University of Hong Kong, in 2007. He is a professor in the School of Computer Science and Technology, Soochow University. His research interests are machine learning and computational biology. He is a member of the CCF, the IEEE, and the ACM.
Ning Jia is a senior development engineer of Baidu, Inc. He received his PhD from the Institute of Acoustics, Chinese Academy of Sciences (IACAS) in 2008. His research interest is topic models.
Yi Zhang is a senior development engineer of Baidu, Inc. Currently he leads the key word recommender group of the Electronic Commerce department, and mainly focuses on sponsored search, machine learning, and data mining. He received his MS in Computer Science from Zhejiang University in 2008.
Rights and permissions
About this article
Cite this article
Ye, Y., Gong, S., Liu, C. et al. Online belief propagation algorithm for probabilistic latent semantic analysis. Front. Comput. Sci. 7, 526–535 (2013). https://doi.org/10.1007/s11704-013-2360-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-013-2360-7