Skip to main content
Log in

Self-organizing weighted incremental probabilistic latent semantic analysis

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

PLSA (Probabilistic Latent Semantic Analysis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large dataset incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynamically discover topics and incrementally learn the topics from new documents. The experiments verify that the proposed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good performance in the application of document categorization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://news.163.com/special/.

References

  1. Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84

    Article  Google Scholar 

  2. Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215:74–89

    Article  MathSciNet  Google Scholar 

  3. Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24

  4. Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algorithms, INVITED SPEAKER, p 2001

  5. Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kannan R, Stolper CD, Inouye D et al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data

  6. Mei Q, Zhai C (2001) A note on em algorithm for probabilistic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM

  7. Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73

    Article  MathSciNet  Google Scholar 

  8. Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst

  9. Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26

    Google Scholar 

  10. Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28

  11. Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191

    Article  Google Scholar 

  12. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440

    Google Scholar 

  13. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120

  14. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 424–433

  15. Wang C, Blei D, Heckerman D (2012) Continuous time dynamic topic models. arXiv:1206.3298

  16. Aggarwal CC, Zhai C (2012) Mining text data. Springer

  17. Gruber A, Rosen-Zvi M, Weiss Y (2012) Latent topic models for hypertext. arXiv:1206.3254

  18. Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models can improve domain term extraction. In: Advances in Information Retrieval. Springer, pp 684–687

  19. Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145

    Article  Google Scholar 

  20. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407

    Article  Google Scholar 

  21. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 50–57

  22. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  23. Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM

  24. Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888

  25. Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilistic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49

  26. Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207

    Article  Google Scholar 

  27. Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender systems. ACM, pp 99–106

  28. Tzu-Chuan Chou MCC (2008) Using incremental plsi for threshold-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299

    Article  Google Scholar 

  29. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196

    Article  Google Scholar 

  30. Surendran AC, Sra S (2006) Incremental aspect models for mining document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640

  31. Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92

    Chapter  Google Scholar 

  32. Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (No. 91546122, 61602438, 61573335, 61473273, 61473274, 61363058), National High-tech R&D Program of China (863 Program) (No. 2014AA015105), National Science and Technology Support Program (No. 2014BAK02B07), National major R&D program of Beijing Municipal Science & Technology Commission (Z161100002616032), Guangdong provincial science and technology plan projects (No. 2015 B 010109005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, N., Luo, W., Yang, K. et al. Self-organizing weighted incremental probabilistic latent semantic analysis. Int. J. Mach. Learn. & Cyber. 9, 1987–1998 (2018). https://doi.org/10.1007/s13042-017-0681-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0681-9

Keywords

Navigation