Abstract
Text classification results can be hindered when just the bag-of-words model is used for representing features, because it ignores word order and senses, which can vary with the context. Embeddings have recently emerged as a means to circumvent these limitations, allowing considerable performance gains. However, determining the best combinations of classification techniques and embeddings for classifying particular corpora can be challenging. This survey provides a comprehensive review of text classification approaches that employ embeddings. First, it analyzes past and recent advancements in feature representation for text classification. Then, it identifies the combinations of embedding-based feature representations and classification techniques that have provided the best performances for classifying text from distinct corpora, also providing links to the original articles, source code (when available) and data sets used in the performance evaluation. Finally, it discusses current challenges and promising directions for text classification research, such as cost-effectiveness, multi-label classification, and the potential of knowledge graphs and knowledge embeddings to enhance text classification.
Similar content being viewed by others
Notes
References
Gantz J, Reinsel D (2012) The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Anal Future 2007(2012):1–16
Altınel B, Ganiz MC (2018) Semantic text classification: a survey of past and recent advances. Inf Process Manag 54(6):1129–1153. https://doi.org/10.1016/j.ipm.2018.08.001
Liu W, Wang T (2010) Index-based online text classification for sms spam filtering. J Comput 5(6):844–851
Hu W, Du J, Xing Y (2016) Spam filtering by semantics-based text classification. In: Intl. Conf. on advanced computational intelligence (ICACI), pp. 89–94. https://doi.org/10.1109/icaci.2016.7449809. IEEE
Dawei W, Alfred R, Obit JH, On CK (2021) A literature review on text classification and sentiment analysis approaches. Computational Science and Technology: 7th ICCST 2020, Pattaya, Thailand, 29–30 August, 2020 724, 305. https://doi.org/10.1007/978-981-33-4069-5_26
Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: 15th ACM SIGKDD Intl. Conf. on knowledge discovery and data mining, pp. 1275–1284. https://doi.org/10.1145/1557019.1557156
Ahmed H, Traore I, Saad S (2018) Detecting opinion spams and fake news using text classification. Secur Priv 1(1):9
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. https://doi.org/10.1145/505282.505283
Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: a review. Multimed Tools Appl 78(3):3797–3816. https://doi.org/10.1007/s11042-018-6083-5
Zha D, Li C (2019) Multi-label dataless text classification with topic modeling. Knowl Inf Syst 61(1):137–160. https://doi.org/10.1007/s10115-018-1280-0
Köhn A (2015) What’s in an embedding? analyzing word embeddings through multilingual evaluation. In: 2015 Conference on empirical methods in natural language processing, pp. 2067–2073. https://doi.org/10.18653/v1/d15-1246
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155. https://doi.org/10.5555/944919.944966
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates Inc, New York
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: 2014 Conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Conf. of the North American Chapter of the ACL, pp. 4171–4186. Association for Computational Linguistics (ACL), s.l
Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Mining Text Data, pp. 163–222. Springer, s.l. https://doi.org/10.1007/978-1-4614-3223-4_6
Nalini K, Sheela LJ (2014) Survey on text classification. Int J Innov Res Adv Eng 1(6):412–417. https://doi.org/10.1007/978-1-4614-3223-4_6
Agarwal B, Mittal N (2014) Text classification using machine learning methods-a survey. In: 2nd intl conf on soft computing for problem solving (SocProS), Dec. 28-30, 2012, pp. 701–709. https://doi.org/10.1007/978-81-322-1602-5_75. Springer
Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019 2nd intl conf on artificial intelligence and Big Data (ICAIBD), pp. 244–250 . https://doi.org/10.1109/icaibd.2019.8836970. IEEE
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52(1):273–292. https://doi.org/10.1007/s10462-018-09677-1
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150. https://doi.org/10.3390/info10040150
Zhou Y (2020) A review of text classification based on deep learning. In: 2020 3rd intl conf on geoinformatics and Data Analysis, pp. 132–136. https://doi.org/10.1145/3397056.3397082
Yang J, Bai L, Guo Y (2020) A survey of text classification models. In: 2020 2nd intl conf on robotics, intelligent control and artificial intelligence, pp. 327–334. https://doi.org/10.1145/3438872.3439101
Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Comput Surv CSUR 54(3):1–40. https://doi.org/10.1145/3439726
Stein RA, Jaques PA, Valiati JF (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232. https://doi.org/10.1016/j.ins.2018.09.001
Kitchenham B (2004) Procedures for performing systematic reviews. Keele UK Keele Univ 33(2004):1–26
Dyba T, Dingsoyr T, Hanssen GK (2007) Applying systematic reviews to diverse study types: an experience report. In: 1st intl. symp. on empirical software engineering and measurement (ESEM), pp. 225–234. https://doi.org/10.1109/esem.2007.59. IEEE
Shen W, Wang J, Han J (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Trans Knowl Data Eng 27(2):443–460. https://doi.org/10.1109/tkde.2014.2327028
Oliveira IL, Fileto R, Speck R, Garcia LPF, Moussallem D, Lehmann J (2021) Towards holistic entity linking: survey and directions. Inf Syst 95:101624. https://doi.org/10.1016/j.is.2020.101624
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv 10(1145/1459352):1459355
Aly R, Remus S, Biemann C (2019) Hierarchical multi-label classification of text with capsule networks. In: 57th annual meeting of the association for computational linguistics: student research workshop, pp. 323–330 . https://doi.org/10.18653/v1/p19-2045
Wu L, Yen IE., Xu K, Xu F, Balakrishnan A, Chen P-Y, Ravikumar P, Witbrock MJ (2018) Word mover’s embedding: from word2vec to document embedding, 4524–4534. https://doi.org/10.18653/v1/D18-1482
Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W Jr (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858. https://doi.org/10.1016/j.is.2011.02.002
Grosman JS, Furtado PH, Rodrigues AM, Schardong GG, Barbosa SD, Lopes HC (2020) Eras: improving the quality control in the annotation process for natural language processing tasks. Inf Syst 93:101553. https://doi.org/10.1016/j.is.2020.101553
Zhang Y, Jin R, Zhou Z-H (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1):43–52. https://doi.org/10.1007/s13042-010-0001-0
Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. An introduction to information retrieval. Cambridge University Press, Cambridge
Cui P, Wang X, Pei J, Zhu W (2018) A survey on network embedding. IEEE Trans on Knowl Data Eng. https://doi.org/10.1109/TKDE.2018.2849727
Lai S, Liu K, He S, Zhao J (2016) How to generate a good word embedding. IEEE Intell Syst 31(6):5–14. https://doi.org/10.1109/mis.2017.2581325
Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv:1901.09069
Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536
Nickel M, Murphy K, Tresp V, Gabrilovich E (2016) A review of relational machine learning for knowledge graphs. IEEE 104(1):11–33. https://doi.org/10.1109/jproc.2015.2483592
Wang Y, Cui L, Zhang Y (2019) Using dynamic embeddings to improve static embeddings. CoRR arXiv:1911.02929
Tripathi N, Oakes M, Wermter S (2015) A scalable meta-classifier combining search and classification techniques for multi-level text categorization. Int J Comput Intell Appl 14(04):1550020. https://doi.org/10.1142/S1469026815500200
Guo N, He Y, Yan C, Liu L, Wang C (2016) Multi-level topical text categorization with wikipedia. In: Proceedings of the 9th iNtl conf on utility and cloud computing. UCC ’16, pp. 343–352. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/2996890.3007856. https://doi.org/10.1145/2996890.3007856
Aggarwal A, Singh J, Gupta K (2018) A review of different text categorization techniques. Int J Eng Technol UAE 7:11–15
Al-Anzi FS, AbuZeina D (2017) A micro-word based approach for arabic sentiment analysis. In: IEEE/ACS 14th Intl. conf on computer systems and applications (AICCSA), pp. 910–914. https://doi.org/10.1109/AICCSA.2017.177
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273
Lenc L, Král P (2017) Word embeddings for multi-label document classification. In: Intl. Conf. Recent Advances in Natural Language Processing, RANLP 2017, pp. 431–437. INCOMA Ltd., Varna, Bulgaria. https://doi.org/10.26615/978-954-452-049-6_057
Zhao W, Ye J, Yang M, Lei Z, Zhang S, Zhao Z (2018) Investigating capsule networks with dynamic routing for text classification. In: 2018 conference on empirical methods in natural language processing, pp. 3110–3119. https://doi.org/10.18653/v1/d18-1350
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3856–3866
Liu Q, Huang H-Y, Gao Y, Wei X, Tian Y, Liu L (2018) Task-oriented word embedding for text classification. In: 27th intl conf on computational linguistics, pp. 2023–2032
Pan C, Huang J, Gong J, Yuan X (2019) Few-shot transfer learning for text classification with lightweight word embedding based models. IEEE Access 7:53296–53304. https://doi.org/10.1109/access.2019.2911850
Pittaras N, Giannakopoulos G, Papadakis G, Karkaletsis V (2021) Text classification with semantically enriched word embeddings. Nat Lang Eng 27(4):391–425. https://doi.org/10.1017/s1351324920000170
Guo B, Zhang C, Liu J, Ma X (2019) Improving text classification with weighted word embeddings via a multi-channel textcnn model. Neurocomputing 363:366–374. https://doi.org/10.1016/j.neucom.2019.07.052
Kim Y (2014) Convolutional neural networks for sentence classification. In: 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar
Shi M, Wang K, Li C (2019) A c-lstm with word embedding model for news text classification. In: 2019 IEEE/ACIS 18th intl conf on computer and information science (ICIS), pp. 253–257. https://doi.org/10.1109/icis46139.2019.8940289. IEEE
Liu H, Chen G, Li P, Zhao P, Wu X (2021) Multi-label text classification via joint learning from label embedding and label correlation. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.07.031
Gallo I, Nawaz S, Landro N, La Grassa R (2021) Visual word embedding for text classification. Springer, Cham, pp 339–352
Zhang J, Lertvittayakumjorn P, Guo Y (2019) Integrating semantic knowledge to tackle zero-shot text classification. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1031–1040. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/n19-1108
Chalkidis I, Fergadiotis M, Malakasiotis P, Androutsopoulos I (2019) Large-scale multi-label text classification on EU legislation. In: 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 6314–6322. Association for Computational Linguistics, s.l. https://doi.org/10.18653/v1/p19-1636
Kim J, Jang S, Park E, Choi S (2020) Text classification using capsules. Neurocomputing 376:214–221. https://doi.org/10.1016/j.neucom.2019.10.033
Moreo A, Esuli A, Sebastiani F (2021) Word-class embeddings for multiclass text classification. Data Min Knowl Disc 35(3):911–963. https://doi.org/10.1007/s10618-020-00735-3
Cai L, Song Y, Liu T, Zhang K (2020) A hybrid bert model that incorporates label semantics via adjustive attention for multi-label text classification. IEEE Access 8:152183–152192
Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J (2020) Text classification using label names only: a language model self-training approach. In: EMNLP, pp. 9006–9017. Association for Computational Linguistics, s.l. https://doi.org/10.18653/v1/2020.emnlp-main.724
Lee S, Lee D, Yu H (2021) Oommix:out-of-manifold regularization in contextual embedding space for text classification. In: 59th annual meeting of the ACL and the 11th intl joint conf on natural language processing, pp. 590–599. Association for Computational Linguistics (ACL), s.l. https://doi.org/10.18653/v1/2021.acl-long.49
Jiang T, Wang D, Sun L, Yang H, Zhao Z, Zhuang F (2021) Lightxml: transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), pp. 7987–7994
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: 31st intl conf on machine learning (ICML) 4
Qiao C, Huang B, Niu G, Li D, Dong D, He W, Yu D, Wu H (2018) A new method of region embedding for text classification. In: Intl conf on learning representations (Poster), pp. 1–12
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. Adv Neural Inf Process Syst 29:730–738
Hossain MR, Hoque MM, Sarker IH (2021) Text classification using convolution neural networks with fasttext embedding. In: Abraham A, Hanne T, Castillo O, Gandhi N, Nogueira Rios T, Hong T-P (eds) Hybrid intelligent systems. Springer, Cham, pp 103–113
Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155. https://doi.org/10.1162/tacl_a_00259
Li Y, Ye M (2020) A text classification model base on region embedding and lstm. In: 2020 6th Intl Conf on Computing and Artificial Intelligence, pp. 152–157. https://doi.org/10.1145/3404555.3404643
Chang W-C, Yu H-F, Zhong K, Yang Y, Dhillon IS (2020) Taming pretrained transformers for extreme multi-label text classification. In: 26th ACM SIGKDD Intl Conf on Knowledge Discovery & Data Mining, pp. 3163–3171. https://doi.org/10.1145/3394486.3403368
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. Curran Associates Inc., Red Hook
Xu H, Dong M, Zhu D, Kotov A, Carcone AI, Naar-King S (2016) Text classification with topic-based word embedding and convolutional neural networks. In: 7th ACM Intl Conf on bioinformatics, computational biology, and health informatics, pp. 88–97
Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: 25th Intl Joint Conf on Artificial Intelligence. IJCAI’16, vol. 16, pp. 2824–2830. AAAI Press, s.l
Kumar V, Pujari AK, Padmanabhan V, Sahu SK, Kagita VR (2018) Multi-label classification using hierarchical embedding. Expert Syst Appl 91:263–269. https://doi.org/10.1016/j.eswa.2017.09.020
Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 2321–2331. Association for Computational Linguistics, Melbourne, Australia. https://doi.org/10.18653/v1/p18-1216
Liu W, Liu P, Yang Y, Yi J, Zhu Z (2019) A< word, part of speech> embedding model for text classification. Expert Syst 36(6):12460
Sinoara RA, Camacho-Collados J, Rossi RG, Navigli R, Rezende SO (2019) Knowledge-enhanced document embeddings for text classification. Knowl-Based Syst 163:955–971. https://doi.org/10.1016/j.knosys.2018.10.026
Aubaid AM, Mishra A (2020) A rule-based approach to embedding techniques for text document classification. Appl Sci 10(11):4009. https://doi.org/10.3390/app10114009
Gupta V, Saw A, Nokhiz P, Gupta H, Talukdar P (2020) Improving document classification with multi-sense embeddings. In: 24th European Conference on Artificial Intelligence - ECAI, Santiago de Compostela, Spain, pp. 1–8. IEEE
Bounabi M, El Moutaouakil K, Satori K (2020) Neural embedding & hybrid ml models for text classification. In: 2020 1st Intl. Conf. on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp. 1–6 . https://doi.org/10.1109/iraset48871.2020.9092230. IEEE
Hu S, He C, Ge B, Liu F (2020) Enhanced word embedding method in text classification. In: 2020 6th Intl Conf on Big Data and Information Analytics (BigDIA), pp. 18–22. https://doi.org/10.1109/bigdia51454.2020.00012. IEEE
Liu N, Wang Q, Ren J (2021) Label-embedding bi-directional attentive model for multi-label text classification. Neural Process Lett 53(1):375–389. https://doi.org/10.1007/s11063-020-10411-8
Zhang C, Yamana H (2021) Improving text classification using knowledge in labels. In: 2021 IEEE 6th Intl Conf on Big Data Analytics (ICBDA), pp. 193–197. https://doi.org/10.1109/icbda51983.2021.9403092
Saraswat A, Abhishek K, Kumar S (2021) Text classification using multilingual sentence embeddings. In: Evolution in Computational Intelligence, pp. 527–536. Springer, s.l
Yang P, Sun X, Li W, Ma S, Wu W, Wang H (2018) SGM: sequence generation model for multi-label classification. In: 27th Intl Conf in Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 3915–3926
Prabhu Y, Varma M (2014) Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In: 20th ACM SIGKDD Intl Conf on Knowledge Discovery and Data Mining, pp. 263–272 . https://doi.org/10.1145/2623330.2623651
Johnson R, Zhang T (2015) Semi-supervised convolutional neural networks for text categorization via region embedding. Advances Neural Inf Process Syst. Vol 28
Nam J, Mencía EL, Fürnkranz J (2016) All-in text: Learning document, label, and word representations jointly. Thirtieth AAAI Conference on Artificial Intelligence. AAAI’16. AAAI Press, Phoenix, Arizona, pp 1948–1954
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. Advances Neural Inf Process Syst. Vol 28
Wetzker R, Zimmermann C, Bauckhage C (2008) Analyzing social bookmarking systems: A delicious cookbook. In: ECAI Mining Social Data Workshop, pp. 26–30
Li J, Ren F (2011) Creating a chinese emotion lexicon based on corpus ren-cecps. In: 2011 IEEE Intl Conf on Cloud Computing and Intelligence Systems, pp. 80–84. https://doi.org/10.1109/ccis.2011.6045036. IEEE
Kowsari K, Brown DE, Heidarysafa M, Meimandi KJ, Gerber MS, Barnes LE (2017) Hdltex: Hierarchical deep learning for text classification. In: 2017 16th IEEE Intl Conf on Machine Learning and Applications (ICMLA), pp. 364–371. https://doi.org/10.1109/icmla.2017.0-134. IEEE
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Systems. Vol. 30
Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M (2020) Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv Neural Inf Process Syst 33:5776–5788
Liu W, Wang H, Shen X, Tsang I (2021) The emerging trends of multi-label learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/tpami.2021.3119334
Acknowledgements
This study was financed by the Fundação de Amparo á Pesquisa e Inovação do Estado de Santa Catarina—Brasil (FAPESC), by the Print CAPES-UFSC Automation 4.0 Project, and the Brazilian National Laboratory for Scientific Computing (LNCC).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
da Costa, L.S., Oliveira, I.L. & Fileto, R. Text classification using embeddings: a survey. Knowl Inf Syst 65, 2761–2803 (2023). https://doi.org/10.1007/s10115-023-01856-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01856-z