ABSTRACT
The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.
- A. Parlina, K. Ramli, and H. Murfi, “Exposing emerging trends in smart sustainable city research using deep autoencoders-based fuzzy c-means,” Sustain., vol. 13, no. 5, pp. 1–28, 2021.Google Scholar
- J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short Text Topic Modeling Techniques, Applications, and Performance: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, 2022.Google ScholarCross Ref
- Y. Zuo, C. Li, H. Lin, and J. Wu, “Topic Modeling of Short Texts: A Pseudo-Document View with Word Embedding Enhancement,” IEEE Trans. Knowl. Data Eng., pp. 2105–2114, 2021.Google Scholar
- D. M. Blei, “Introduction to probabilistic topic models,” Princeton University, 2011. [Online]. Available: https://www.eecis.udel.edu/∼shatkay/Course/papers/UIntrotoTopicModelsBlei2011-5.pdf. [Accessed: 10-Jul-2022].Google Scholar
- D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.Google ScholarCross Ref
- J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based approach for short text clustering,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 233–242, 2014.Google ScholarDigital Library
- R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Front. Artif. Intell., vol. 3, no. July, pp. 1–14, 2020.Google ScholarCross Ref
- F. Yi, B. Jiang, and J. Wu, “Topic Modeling for Short Texts via Word Embedding and Document Correlation,” IEEE Access, vol. 8, pp. 30692–30705, 2020.Google ScholarCross Ref
- J. Mazarura and A. De Waal, “A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text,” 2016 Pattern Recognit. Assoc. South Africa Robot. Mechatronics Int. Conf. PRASA-RobMech 2016, pp. 1–6, 2017.Google Scholar
- C. Weisser , “Pseudo - document simulation for comparing LDA , GSDMM and GPM topic models on short and sparse text using Twitter data,” Comput. Stat., no. 0123456789, 2022.Google ScholarCross Ref
- Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations on short text topic mining between LDA and NMF based Schemes,” Knowledge-Based Syst., vol. 163, pp. 1–13, 2019.Google ScholarCross Ref
- A. Kulkarni and A. Shivananda, Natural language processing recipes: Unlocking text data with machine learning and deep learning using python. Apress, 2019.Google ScholarCross Ref
- K. Amarasinghe, M. Manic, and R. Hruska, “Optimal stop word selection for text mining in critical infrastructure domain,” in 2015 Resilience Week (RWS), 2015, pp. 179–184.Google Scholar
- A. Guo and T. Yang, “Research and improvement of feature words weight based on TFIDF algorithm,” Proc. 2016 IEEE Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2016, pp. 415–419, 2016.Google ScholarCross Ref
- P. Suri and N. R. Roy, “Comparison between LDA & NMF for event-detection from large text stream data,” in 3rd IEEE International Conference on, 2017, pp. 1–5.Google ScholarCross Ref
- T. D. Hien, D. Van Tuan, P. Van At, and L. H. Son, “Novel algorithm for non-negative matrix factorization,” New Math. Nat. Comput., vol. 11, no. 2, pp. 121–133, 2015.Google ScholarCross Ref
Recommendations
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide WebUncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Topic modeling methods for short texts: A survey
In the present day, online users are incentivized to engage in short text-based communication. These short texts harbor a significant amount of implicit information, including opinions, topics, and emotions, which are of notable value for both ...
Sparse Biterm Topic Model for Short Texts
Web and Big DataAbstractExtracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
Comments