Skip to main content

Text Classification Models and Topic Models: An Overall Picture and a Case Study in Vietnamese

  • Conference paper
  • First Online:
Book cover Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications (FDSE 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1688))

Included in the following conference series:

  • 1753 Accesses

Abstract

Document classifiers are supervised learning models in which documents are assigned labels based on models that are trained on labeled datasets. The accuracy of a classifier depends on the size and quality of training datasets, which are costly and time-consuming to construct. Besides, a suitable word representation method may improve the quality of the text classifier. In this paper, we study the effect of different word representation methods on 16 classification models trained on a labeled dataset. Then, we experiment with the ability to discover latent topics using 6 topic models. Based on experimental results using combination of classification models and topic models, we propose a method to label datasets for training classification models using topic models and classification models. Although we perform experiments on a Vietnamese document dataset, our approach may apply to any datasets and does not require any labeled datasets for bootstrapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.sketchengine.eu/covid19/.

  2. 2.

    https://github.com/duyvuleo/VNTC.

  3. 3.

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

  4. 4.

    https://pypi.org/project/underthesea/.

  5. 5.

    https://github.com/stopwords/vietnamese-stopwords.

  6. 6.

    https://scikit-learn.org/.

  7. 7.

    https://fasttext.cc/.

  8. 8.

    https://github.com/google-research/bert.

  9. 9.

    https://huggingface.co/vinai/phobert-base.

  10. 10.

    https://maartengr.github.io/BERTopic/index.html.

References

  1. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees. Neural Comput. 9(7), 1545–1588 (1997)

    Article  Google Scholar 

  2. Angelov, D.: Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470 (2020)

  3. Aroraa, J., Patankara, T., Shaha, A., Joshia, S.: Artificial intelligence as legal research assistant. In: CEUR Workshop Proceedings (CEUR-WS.org) (2020)

    Google Scholar 

  4. Bayes, F.: An essay towards solving a problem in the doctrine of chances. Biometrika 45(3–4), 296–315 (1958)

    Article  MATH  Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  7. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees (CART).: Belmont. Wadsworth International Group, CA, USA (1984)

    Google Scholar 

  8. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    Article  MATH  Google Scholar 

  9. Breslow, L.A., Aha, D.W.: Simplifying decision trees: A survey. Knowl. Eng. Rev. 12(01), 1–40 (1997)

    Article  Google Scholar 

  10. Bühlmann, P.: Bagging, boosting and ensemble methods. In: Handbook of Computational Statistics, pp. 985–1022. Springer (2012). https://doi.org/10.1007/978-3-642-21551-3_33

  11. Chen, X., Xia, Y., Jin, P., Carroll, J.: Dataless text classification with descriptive LDA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)

    Google Scholar 

  12. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  13. Dennis, S., Landauer, T., Kintsch, W., Quesada, J.: Introduction to latent semantic analysis. In: 25th Annual Meeting of the Cognitive Science Society, Boston, Mass, p. 25 (2003)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  15. Dua, D., Graff, C.: UCI machine learning repository. irvine, ca: University of california, school of information and computer science (1997). http://archive.ics.uci.edu/ml

  16. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  17. Fix, E., Hodges, J.L.: Discriminatory analysis. Nonparametric discrimination: Consistency properties. Int. Stat. Review/Revue Internationale de Statistique 57(3), 238–247 (1989)

    Google Scholar 

  18. Franklin, J.: The elements of statistical learning: data mining, inference and prediction. Math. Intell. 27(2), 83–85 (2005). https://doi.org/10.1007/BF02985802

    Article  Google Scholar 

  19. Friedman, J., Hastie, T., Tibshirani, R., et al.: The elements of statistical learning, vol. 1. Springer series in statistics, New York (2001)

    Google Scholar 

  20. Fu, R., Qin, B., Liu, T.: Open-categorical text classification based on multi-LDA models. Soft. Comput. 19(1), 29–38 (2015)

    Article  Google Scholar 

  21. Ghasiya, P., Okamura, K.: Investigating COVID-19 news across four nations: A topic modeling and sentiment analysis approach. IEEE Access 9, 36645–36656 (2021)

    Article  Google Scholar 

  22. Grootendorst, M.: BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. Zenodo, Version v0 4 (2020)

    Google Scholar 

  23. Hasan, M., Hossain, M.M., Ahmed, A., Rahman, M.S.: Topic modelling: A comparison of the performance of Latent Dirichlet Allocation and LDA2vec model on Bangla newspaper. In: 2019 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5. IEEE (2019)

    Google Scholar 

  24. Hoang, V.C.D., Dinh, D., Le Nguyen, N., Ngo, H.Q.: A comparative study on Vietnamese text classification methods. In: 2007 IEEE International Conference on Research, Innovation and Vision for the Future. pp. 267–273. IEEE (2007)

    Google Scholar 

  25. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  26. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1), 177–196 (2001)

    Article  MATH  Google Scholar 

  27. Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on Twitter through text classification. In: Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS) (2010)

    Google Scholar 

  28. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683

    Chapter  Google Scholar 

  29. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)

  30. Karamizadeh, S., Abdullah, S.M., Halimi, M., Shayan, J., javad Rajabi, M.: Advantage and drawback of Support Vector Machine functionality. In: 2014 international conference on computer, communications, and control technology (I4CT), pp. 63–65. IEEE (2014)

    Google Scholar 

  31. Kherwa, P., Bansal, P.: Topic modeling: a comprehensive review. EAI Endorsed Trans. Scalable Inform. Syst. 7(24) (2020)

    Google Scholar 

  32. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)

    Article  Google Scholar 

  33. Lam, K.N., Truong, L.T., Kalita, J.: Using topic models to label documents for classification. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 443–451. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_32

    Chapter  Google Scholar 

  34. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)

    Google Scholar 

  35. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  36. Li, X., Roth, D.: Learning question classifiers. In: COLING 2002: The 19th International Conference on Computational Linguistics (2002)

    Google Scholar 

  37. Lord, G., et al.: Exploring erotics in Emily Dickinson’s correspondence with text mining and visual interfaces. In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL 2006), pp. 141–150. IEEE (2006)

    Google Scholar 

  38. Luo, Y., Shi, H.: Using LDA2vec topic modeling to identify latent topics in aviation safety reports. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp. 518–523. IEEE (2019)

    Google Scholar 

  39. McInnes, L., Healy, J., Astels, S.: HDBSCAN: Hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)

    Article  Google Scholar 

  40. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  41. McLachlan, G.J.: Discriminant analysis and statistical pattern recognition, vol. 544. John Wiley & Sons (2004)

    Google Scholar 

  42. Mifrah, S., Benlahmar, E.: Topic modeling coherence: A comparative study between LDA and NMF models using COVID’19 corpus. Int. J. Adv. Trends Comput. Sci. Eng. 5756–5761 (2020)

    Google Scholar 

  43. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  44. Mohammed, S.H., Al-augby, S.: LSA & LDA topic modeling classification: Comparison study on e-books. Indonesian J. Elect. Eng. Comput. Sci. 19(1), 353–362 (2020)

    Article  Google Scholar 

  45. Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make LDA2vec. arXiv preprint arXiv:1605.02019 (2016)

  46. Nguyen, A.T., Dao, M.H., Nguyen, D.Q.: A pilot study of text-to-SQL semantic parsing for Vietnamese. arXiv preprint arXiv:2010.01891 (2020)

  47. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: Pre-trained language models for Vietnamese. arXiv preprint arXiv:2003.00744 (2020)

  48. Nowak, J., Taspinar, A., Scherer, R.: Convolutional neural networks for sentence classification. In: The 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1746–1751 (2014)

    Google Scholar 

  49. Nowak, J., Taspinar, A., Scherer, R.: LSTM recurrent neural networks for short text and sentiment classification. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10246, pp. 553–562. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59060-8_50

    Chapter  Google Scholar 

  50. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058 (2004)

    Google Scholar 

  51. Pavlinek, M., Podgorelec, V.: Text classification method based on self-training and LDA topic models. Expert Syst. Appl. 80, 83–93 (2017)

    Article  Google Scholar 

  52. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Article  Google Scholar 

  53. Quinlan, J.R.: C4.5: programs for machine learning. Elsevier (2014)

    Google Scholar 

  54. Ranjan, M.N.M., Ghorpade, Y., Kanthale, G., Ghorpade, A., Dubey, A.: Document classification using LSTM neural network. J. Data Mining Manag. 2(2), 1–9 (2017)

    Google Scholar 

  55. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  56. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. IEEE (2012)

    Google Scholar 

  57. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)

  58. Sharma, S., Agrawal, J., Sharma, S.: Classification through machine learning technique: C4.5 algorithm based on various entropies. Int. J. Comput. Appli. 82(16) (2013)

    Google Scholar 

  59. Silva, J., Coheur, L., Mendes, A.C., Wichert, A.: From symbolic to sub-symbolic information in question classification. Artif. Intell. Rev. 35(2), 137–154 (2011)

    Article  Google Scholar 

  60. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961 (2012)

    Google Scholar 

  61. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis. Edited by Landauer, T.K., McNamara, D.S., Dennis, S., kintsch, W., Erlbaum, N.J: Information Science in Korea using Topic Modeling. J. Korean Soc. Inf. Manag. 30(1), 7–32 (2007)

    Google Scholar 

  62. Taser, P.Y.: Application of bagging and boosting approaches using decision tree-based algorithms in diabetes risk prediction. In: Multidisciplinary Digital Publishing Institute Proceedings, vol. 74, p. 6 (2021)

    Google Scholar 

  63. Tharwat, A.: Linear vs. quadratic discriminant analysis classifier: a tutorial. Int. J. Appli. Pattern Recogn. 3(2), 145–180 (2016)

    Google Scholar 

  64. Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Proc. Eng. 69, 1356–1364 (2014)

    Article  Google Scholar 

  65. Vapnik, V.: The nature of statistical learning theory. Springer Science & Business Media (2013). https://doi.org/10.1007/978-1-4757-3264-1

  66. Vu, X.S., Vu, T., Tran, S.N., Jiang, L.: ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task. arXiv preprint arXiv:1903.04433 (2019)

  67. Wang, S., Manning, C.: Fast dropout training. In: International Conference on Machine Learning, pp. 118–126. PMLR (2013)

    Google Scholar 

  68. Wang, S.I., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 90–94 (2012)

    Google Scholar 

  69. Yang, B., Cardie, C.: Context-aware learning for sentence-level sentiment analysis with posterior regularization. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 325–335 (2014)

    Google Scholar 

  70. Yu, B.: An evaluation of text classification methods for literary study. Liter. Lingu. Comput. 23(3), 327–343 (2008)

    Article  Google Scholar 

  71. Zhang, J., Li, Y., Tian, J., Li, T.: LSTM-CNN hybrid model for text classification. In: 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1675–1680. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khang Nhut Lam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lam, K.N., Tran, VL.L., Kalita, J. (2022). Text Classification Models and Topic Models: An Overall Picture and a Case Study in Vietnamese. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2022. Communications in Computer and Information Science, vol 1688. Springer, Singapore. https://doi.org/10.1007/978-981-19-8069-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8069-5_25

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8068-8

  • Online ISBN: 978-981-19-8069-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics