Abstract
This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts.
Supported by CNPq for the research grant PQ 303356/2022-7; CAPES for the projects STIC-AmSud (CAMA) No. 88881.694458/2022-01; Mackenzie-PrInt No. 88887.310281/2018-00; and FAPESP for grant 2021/11905-0.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Cahyani, D.E., Patasik, I.: Performance comparison of TF-IDF and word2vec models for emotion text classification. Bull. Electr. Eng. Inform. 10(5), 2780–2788 (2021)
Dasgupta, D., Yu, S., Nino, F.: Recent advances in artificial immune systems: models and applications. Appl. Soft Comput. 11(2), 1574–1587 (2011)
de Castro, L.N., Timmis, J.: Artificial immune systems: a new computational intelligence approach. Springer Science & Business Media (2002)
de Castro, L.N., Von Zuben, F.J.: aiNet: an artificial immune network for data analysis. In: Data Mining: a Heuristic Approach, pp. 231–260. IGI Global (2002)
Hang, S.: Clustering short texts: categorizing initial utterances from customer service dialogue agents (2021)
Jerne, N.K.: Towards a network theory of the immune system. Ann. Immunol. 125C(1–2), 373–389 (1974)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Naseem, U., Razzak, I., Khan, S.K., Prasad, M.: A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. Trans. Asian Low-Resource Lang. Inf. Process. 20(5), 1–35 (2021)
Pibiri, G.E., Venturini, R.: Handling massive n-gram datasets efficiently. ACM Trans. Inf. Syst. (TOIS) 37(2), 1–41 (2019)
Puigcerver, J., Toselli, A.H., Vidal, E.: Querying out-of-vocabulary words in lexicon-based keyword spotting. Neural Comput. Appl. 28, 2373–2382 (2017)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 252–259 (2003)
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Wilson, M.: MRC psycholinguistic database: machine-usable dictionary, version 2.00. Behavior Res. Methods Instrum. Comput. 20(1), 6–10 (1988)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ferraria, M.A., Balbi, P.P., de Castro, L.N. (2025). An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm. In: Chinthaginjala, R., Sitek, P., Min-Allah, N., Matsui, K., Ossowski, S., Rodríguez, S. (eds) Distributed Computing and Artificial Intelligence, 21st International Conference. DCAI 2024. Lecture Notes in Networks and Systems, vol 1259. Springer, Cham. https://doi.org/10.1007/978-3-031-82073-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-82073-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-82072-4
Online ISBN: 978-3-031-82073-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)