research-article

Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts

Authors:
Nuraisa Novia Hidayati

National Research and Innovation Agency (BRIN), Indonesia

National Research and Innovation Agency (BRIN), Indonesia

0000-0001-5606-8627
View Profile

,
Anne Parlina

National Research and Innovation Agency (BRIN), Indonesia

National Research and Innovation Agency (BRIN), Indonesia

0000-0001-9460-6895
View Profile

IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its ApplicationsNovember 2022Pages 117–120https://doi.org/10.1145/3575882.3575905

Published:27 February 2023Publication History

IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications

Pages 117–120

ABSTRACT

The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.

References

A. Parlina, K. Ramli, and H. Murfi, “Exposing emerging trends in smart sustainable city research using deep autoencoders-based fuzzy c-means,” Sustain., vol. 13, no. 5, pp. 1–28, 2021.Google Scholar
J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short Text Topic Modeling Techniques, Applications, and Performance: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, 2022.Google ScholarCross Ref
Y. Zuo, C. Li, H. Lin, and J. Wu, “Topic Modeling of Short Texts: A Pseudo-Document View with Word Embedding Enhancement,” IEEE Trans. Knowl. Data Eng., pp. 2105–2114, 2021.Google Scholar
D. M. Blei, “Introduction to probabilistic topic models,” Princeton University, 2011. [Online]. Available: https://www.eecis.udel.edu/∼shatkay/Course/papers/UIntrotoTopicModelsBlei2011-5.pdf. [Accessed: 10-Jul-2022].Google Scholar
D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.Google ScholarCross Ref
J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based approach for short text clustering,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 233–242, 2014.Google ScholarDigital Library
R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Front. Artif. Intell., vol. 3, no. July, pp. 1–14, 2020.Google ScholarCross Ref
F. Yi, B. Jiang, and J. Wu, “Topic Modeling for Short Texts via Word Embedding and Document Correlation,” IEEE Access, vol. 8, pp. 30692–30705, 2020.Google ScholarCross Ref
J. Mazarura and A. De Waal, “A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text,” 2016 Pattern Recognit. Assoc. South Africa Robot. Mechatronics Int. Conf. PRASA-RobMech 2016, pp. 1–6, 2017.Google Scholar
C. Weisser , “Pseudo - document simulation for comparing LDA , GSDMM and GPM topic models on short and sparse text using Twitter data,” Comput. Stat., no. 0123456789, 2022.Google ScholarCross Ref
Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations on short text topic mining between LDA and NMF based Schemes,” Knowledge-Based Syst., vol. 163, pp. 1–13, 2019.Google ScholarCross Ref
A. Kulkarni and A. Shivananda, Natural language processing recipes: Unlocking text data with machine learning and deep learning using python. Apress, 2019.Google ScholarCross Ref
K. Amarasinghe, M. Manic, and R. Hruska, “Optimal stop word selection for text mining in critical infrastructure domain,” in 2015 Resilience Week (RWS), 2015, pp. 179–184.Google Scholar
A. Guo and T. Yang, “Research and improvement of feature words weight based on TFIDF algorithm,” Proc. 2016 IEEE Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2016, pp. 415–419, 2016.Google ScholarCross Ref
P. Suri and N. R. Roy, “Comparison between LDA & NMF for event-detection from large text stream data,” in 3rd IEEE International Conference on, 2017, pp. 1–5.Google ScholarCross Ref
T. D. Hien, D. Van Tuan, P. Van At, and L. H. Son, “Novel algorithm for non-negative matrix factorization,” New Math. Nat. Comput., vol. 11, no. 2, pp. 121–133, 2015.Google ScholarCross Ref

Recommendations

A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide Web

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Read More
Topic modeling methods for short texts: A survey

In the present day, online users are incentivized to engage in short text-based communication. These short texts harbor a significant amount of implicit information, including opinions, topics, and emotions, which are of notable value for both ...
Read More
Sparse Biterm Topic Model for Short Texts
Web and Big Data
Abstract
Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications
November 2022
415 pages
ISBN:9781450397902
DOI:10.1145/3575882

Copyright © 2022 ACM
© 2022 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 February 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 31
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts

IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications

ABSTRACT

References

Cited By

Recommendations

A biterm topic model for short texts

Topic modeling methods for short texts: A survey

Sparse Biterm Topic Model for Short Texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts

IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications

ABSTRACT

References

Cited By

Recommendations

A biterm topic model for short texts

Topic modeling methods for short texts: A survey

Sparse Biterm Topic Model for Short Texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media