Using Topic Models to Label Documents for Classification

Lam, Khang Nhut; Truong, Lam Thanh; Kalita, Jugal

doi:10.1007/978-981-33-4370-2_32

Using Topic Models to Label Documents for Classification

Khang Nhut Lam⁹,
Lam Thanh Truong⁹ &
Jugal Kalita¹⁰

Conference paper
First Online: 19 November 2020

1299 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1306))

Abstract

Document classifiers are supervised learning models in which documents are assigned categories based on models that are trained on annotated datasets. In this paper, we use topic models to automatically assign categories to documents, which later are fed to document classification models. We perform experiments on several datasets in Vietnamese, collected from free online resources. Our method is promising and applicable to many datasets that have not been labeled.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Hasnat, F., Hasan, M., Khan, N.H., Ali, A.: Text classification using machine learning algorithms. Doctoral dissertation, Brac University, Bangladesh (2018)
Google Scholar
Benkhelifa, R., Laallam, F.Z.: Facebook posts text classification to improve information filtering. In: International Conference on Web Information Systems and Technologies, pp. 202–207 (2016)
Google Scholar
Dadgar, S.M.H., Araghi, M.S., Farahani, M.M.: A novel text mining approach based on TF-IDF and support vector machine for news classification. In: IEEE International Conference on Engineering and Technology, pp. 112–116 (2016)
Google Scholar
Kim Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Silva, J., Coheur, L., Mendes, A.C., Wichert, A.: From symbolic to sub-symbolic information in question classification. Artif. Intell. Rev. 35(2), 137–54 (2011)
Article Google Scholar
Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classification. In: the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 90–94 (2012)
Google Scholar
Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: The Conference on Empirical Methods in Natural Language Processing, pp. 151–161 (2011)
Google Scholar
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: The Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)
Google Scholar
Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: The 2018 World Wide Web Conference, pp. 1063–1072 (2018)
Google Scholar
Rao, A., Spasojevic, N.: Actionable and political text classification using word embeddings and LSTM. arXiv preprint arXiv:1607.02501 (2016)
Xiao, L., Wang, G., Zuo, Y.: Research on patent text classification based on Word2Vec and LSTM. In: The 11th International Symposium on Computational Intelligence and Design (ISCID), vol. 1, pp. 71–74 (2018)
Google Scholar
Liu, G., Guo, J.: Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–38 (2019)
Article Google Scholar
Google Scholar
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Dumais, S.T.: Latent semantic analysis. Ann. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Article Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make LDA2Vec. arXiv preprint arXiv:1605.02019 (2016)
Lam, K.N., To, T.H., Tran, T.T., Kalita, J.: Improving Vietnamese WordNet using word embedding. In: The 3rd International Conference on Natural Language Processing and Information Retrieval, pp. 110–114 (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Jedrzejowicz, J., Zakrzewska, M.: Text classification using LDA-W2V hybrid algorithm. In: Czarnowski, I., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies 2019. SIST, vol. 142, pp. 227–237. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-8311-3_20
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Can Tho University, Can Tho, Vietnam
Khang Nhut Lam & Lam Thanh Truong
University of Colorado, Colorado Springs, USA
Jugal Kalita

Authors

Khang Nhut Lam
View author publications
You can also search for this author in PubMed Google Scholar
Lam Thanh Truong
View author publications
You can also search for this author in PubMed Google Scholar
Jugal Kalita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khang Nhut Lam .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Hosei University, Tokyo, Japan
Makoto Takizawa
Sungkyunkwan University, Suwon, Korea (Republic of)
Tai M. Chung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lam, K.N., Truong, L.T., Kalita, J. (2020). Using Topic Models to Label Documents for Classification. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2020. Communications in Computer and Information Science, vol 1306. Springer, Singapore. https://doi.org/10.1007/978-981-33-4370-2_32

Download citation

DOI: https://doi.org/10.1007/978-981-33-4370-2_32
Published: 19 November 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4369-6
Online ISBN: 978-981-33-4370-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics