skip to main content
10.1145/3575882.3575905acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesic3inaConference Proceedingsconference-collections
research-article

Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts

Published:27 February 2023Publication History

ABSTRACT

The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.

References

  1. A. Parlina, K. Ramli, and H. Murfi, “Exposing emerging trends in smart sustainable city research using deep autoencoders-based fuzzy c-means,” Sustain., vol. 13, no. 5, pp. 1–28, 2021.Google ScholarGoogle Scholar
  2. J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short Text Topic Modeling Techniques, Applications, and Performance: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  3. Y. Zuo, C. Li, H. Lin, and J. Wu, “Topic Modeling of Short Texts: A Pseudo-Document View with Word Embedding Enhancement,” IEEE Trans. Knowl. Data Eng., pp. 2105–2114, 2021.Google ScholarGoogle Scholar
  4. D. M. Blei, “Introduction to probabilistic topic models,” Princeton University, 2011. [Online]. Available: https://www.eecis.udel.edu/∼shatkay/Course/papers/UIntrotoTopicModelsBlei2011-5.pdf. [Accessed: 10-Jul-2022].Google ScholarGoogle Scholar
  5. D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based approach for short text clustering,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 233–242, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Front. Artif. Intell., vol. 3, no. July, pp. 1–14, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  8. F. Yi, B. Jiang, and J. Wu, “Topic Modeling for Short Texts via Word Embedding and Document Correlation,” IEEE Access, vol. 8, pp. 30692–30705, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Mazarura and A. De Waal, “A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text,” 2016 Pattern Recognit. Assoc. South Africa Robot. Mechatronics Int. Conf. PRASA-RobMech 2016, pp. 1–6, 2017.Google ScholarGoogle Scholar
  10. C. Weisser , “Pseudo - document simulation for comparing LDA , GSDMM and GPM topic models on short and sparse text using Twitter data,” Comput. Stat., no. 0123456789, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  11. Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations on short text topic mining between LDA and NMF based Schemes,” Knowledge-Based Syst., vol. 163, pp. 1–13, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Kulkarni and A. Shivananda, Natural language processing recipes: Unlocking text data with machine learning and deep learning using python. Apress, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. Amarasinghe, M. Manic, and R. Hruska, “Optimal stop word selection for text mining in critical infrastructure domain,” in 2015 Resilience Week (RWS), 2015, pp. 179–184.Google ScholarGoogle Scholar
  14. A. Guo and T. Yang, “Research and improvement of feature words weight based on TFIDF algorithm,” Proc. 2016 IEEE Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2016, pp. 415–419, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  15. P. Suri and N. R. Roy, “Comparison between LDA & NMF for event-detection from large text stream data,” in 3rd IEEE International Conference on, 2017, pp. 1–5.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. D. Hien, D. Van Tuan, P. Van At, and L. H. Son, “Novel algorithm for non-negative matrix factorization,” New Math. Nat. Comput., vol. 11, no. 2, pp. 121–133, 2015.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications
    November 2022
    415 pages
    ISBN:9781450397902
    DOI:10.1145/3575882

    Copyright © 2022 ACM

    © 2022 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 February 2023

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited
  • Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)2

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format