When are Latent Topics Useful for Text Mining?

Kanungsukkasem, Nont; Chuangkrud, Piyawat; Pitichotchokphokhin, Pimpitcha; Damrongrat, Chaianun; Leelanupab, Teerapong

doi:10.1007/978-3-031-42430-4_17

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1863))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

484 Accesses

Abstract

The Bag-of-Words (BOW) model is simple but one of the successful representations of text documents. This model, however, suffers from the sparse matrix, in which most of the elements are zero. Topic modeling is an unsupervised learning method that can represent text documents in a low-dimensional space. Latent Dirichlet Allocation (LDA) is a topic modeling technique used for topic extraction and data exploration, with interpretable output. This paper presents a thorough study of potential benefits of applying LDA, as a feature extraction, to topic discovery and document classification in Thai news articles, comparing with TF–IDF and Word2Vec. We also studied how much of the top Thai terms extracted from LDA with the different numbers of topics can be interpretable and meaningful, and can be a representative of the corpus. Besides, a set of Topic Coherence measures were included in our study to estimate the degree of semantic similarity of extracted topics. To compare the performance and optimization time of classification of features from the different feature extraction methods, various types of classifiers, e.g., Logistic Regression, Random Forest, XGBoosting, etc., were experimented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Topic Modeling for Text Classification

Notes

1.
https://www.bangkokbiznews.com.
2.
“Lifestyle” category includes contents from other subcategories, e.g., health and sport.
3.
As Accuracy is in percentage, we do not need any normalization like TL.
4.
We provide a hyperlink for each Thai word leading to its meaning in English.
5.
All 37 topics and 300 topics can be viewed via the provided link attached to this footnote.

References

Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: IWCS 2013 (2013)
Google Scholar
Asawaroengchai, C., Chaisangmongkon, W., Laowattana, D.: Probabilistic learning models for topic extraction in Thai language. In: 2018 5th International Conference on Business and Industrial Research (2018)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan) (2003)
Google Scholar
Bonaccorso, G.: Machine learning algorithms (2017)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, vol. 30 (2009)
Google Scholar
Chen, T., et al.: XGBoost: extreme gradient boosting. R package version 0.4-2 1(4) (2015)
Google Scholar
Chormai, P., Prasertsom, P., Rutherford, A.: AttaCut: a fast and accurate neural Thai word segmenter (2019)
Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Guyon, I., et al. (eds.) NeurIPS, vol. 30 (2017)
Google Scholar
Li, C., et al.: LDA meets word2vec: a novel model for academic abstract clustering. In: WWW 2018 (2018)
Google Scholar
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4) (1957)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP (2011)
Google Scholar
Nararatwong, R., Legaspi, R., Cooharojananone, N., Okada, H., Maruyama, H.: Solving the difficult problem of topic extraction in Thai tweets. J. Telecommun. Electron. Comput. Eng. 8(6) (2016)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: NAACL HLT 2010 (2010)
Google Scholar
Pitichotchokphokhin, P., Chuangkrud, P., Kalakan, K., Suntisrivaraporn, B., Leelanupab, T., Kanungsukkasem, N.: Discover underlying topics in Thai news articles: a comparative study of probabilistic and matrix factorization approaches. In: ECTI-CON 2020 (2020)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: WSDM 2015 (2015)
Google Scholar
Wang, Z., Ma, L., Zhang, Y.: A hybrid document feature extraction method using latent Dirichlet allocation and word2vec. In: DSC 2016 (2016)
Google Scholar

Download references

Acknowledgements

This work was supported by KMITL Research Fund under Research Seed Grant for New Lecturer with grant number: KREF186507.

Author information

Authors and Affiliations

Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang (KMITL), Bangkok, 10520, Thailand
Nont Kanungsukkasem, Piyawat Chuangkrud, Pimpitcha Pitichotchokphokhin & Teerapong Leelanupab
National Electronics and Computer Technology Center (NECTEC), Pathumthani, 12120, Thailand
Piyawat Chuangkrud & Chaianun Damrongrat
The University of Queensland, Brisbane, QLD, 4072, Australia
Teerapong Leelanupab

Authors

Nont Kanungsukkasem
View author publications
You can also search for this author in PubMed Google Scholar
Piyawat Chuangkrud
View author publications
You can also search for this author in PubMed Google Scholar
Pimpitcha Pitichotchokphokhin
View author publications
You can also search for this author in PubMed Google Scholar
Chaianun Damrongrat
View author publications
You can also search for this author in PubMed Google Scholar
Teerapong Leelanupab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Teerapong Leelanupab .

Editor information

Editors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Siridech Boonsang
Iwate Prefectural University, Iwate, Japan
Hamido Fujita
Wrocław University of Science and Technology, Wrocław, Poland
Bogumiła Hnatkowska
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
King Mongkut's Institute of Technology, Ladkrabang, Thailand
Kitsuchart Pasupa
Malaysia Japan International Institute of Technology, Kuala Lumpur, Malaysia
Ali Selamat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kanungsukkasem, N., Chuangkrud, P., Pitichotchokphokhin, P., Damrongrat, C., Leelanupab, T. (2023). When are Latent Topics Useful for Text Mining?. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-42430-4_17
Published: 29 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

When are Latent Topics Useful for Text Mining?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Topic Modeling for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

When are Latent Topics Useful for Text Mining?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Topic Modeling for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation