An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Ferraria, Matheus A.; Balbi, Pedro P.; de Castro, Leandro N.

doi:10.1007/978-3-031-82073-1_25

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1259))

Included in the following conference series:

International Symposium on Distributed Computing and Artificial Intelligence

113 Accesses

Abstract

This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts.

Supported by CNPq for the research grant PQ 303356/2022-7; CAPES for the projects STIC-AmSud (CAMA) No. 88881.694458/2022-01; Mackenzie-PrInt No. 88887.310281/2018-00; and FAPESP for grant 2021/11905-0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.99; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Google Scholar
Cahyani, D.E., Patasik, I.: Performance comparison of TF-IDF and word2vec models for emotion text classification. Bull. Electr. Eng. Inform. 10(5), 2780–2788 (2021)
Article MATH Google Scholar
Dasgupta, D., Yu, S., Nino, F.: Recent advances in artificial immune systems: models and applications. Appl. Soft Comput. 11(2), 1574–1587 (2011)
Article MATH Google Scholar
de Castro, L.N., Timmis, J.: Artificial immune systems: a new computational intelligence approach. Springer Science & Business Media (2002)
Google Scholar
de Castro, L.N., Von Zuben, F.J.: aiNet: an artificial immune network for data analysis. In: Data Mining: a Heuristic Approach, pp. 231–260. IGI Global (2002)
Google Scholar
Hang, S.: Clustering short texts: categorizing initial utterances from customer service dialogue agents (2021)
Google Scholar
Jerne, N.K.: Towards a network theory of the immune system. Ann. Immunol. 125C(1–2), 373–389 (1974)
MATH Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Google Scholar
Naseem, U., Razzak, I., Khan, S.K., Prasad, M.: A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. Trans. Asian Low-Resource Lang. Inf. Process. 20(5), 1–35 (2021)
Article MATH Google Scholar
Pibiri, G.E., Venturini, R.: Handling massive n-gram datasets efficiently. ACM Trans. Inf. Syst. (TOIS) 37(2), 1–41 (2019)
Article MATH Google Scholar
Puigcerver, J., Toselli, A.H., Vidal, E.: Querying out-of-vocabulary words in lexicon-based keyword spotting. Neural Comput. Appl. 28, 2373–2382 (2017)
Article Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 252–259 (2003)
Google Scholar
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Article MATH Google Scholar
Wilson, M.: MRC psycholinguistic database: machine-usable dictionary, version 2.00. Behavior Res. Methods Instrum. Comput. 20(1), 6–10 (1988)
Google Scholar

Download references

Author information

Authors and Affiliations

FCI - PPGEEC - Programa de Pós-Graduação em Engenharia Elétrica e Computação, Universidade Presbiteriana Mackenzie, São Paulo, SP, Brazil
Matheus A. Ferraria & Pedro P. Balbi
Graduate Program in Technology, School of Technology, State University of Campinas (Unicamp), Limeira, São Paulo, SP, Brazil
Leandro N. de Castro
Human-Centered Artificial Intelligence and Natural Computing Research Group, Florida Gulf Coast University, Fort Myers, FL, 33965, USA
Leandro N. de Castro

Authors

Matheus A. Ferraria
View author publications
You can also search for this author in PubMed Google Scholar
Pedro P. Balbi
View author publications
You can also search for this author in PubMed Google Scholar
Leandro N. de Castro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matheus A. Ferraria .

Editor information

Editors and Affiliations

Vellore Institute of Technology University, Vellore, Tamil Nadu, India
Ravikumar Chinthaginjala
Kielce University of Technology, Kielce, Poland
Pawel Sitek
Department of Computer Science, Imam Abdulrahman Bin Faisal University (KSA), Dammam, Saudi Arabia
Nasro Min-Allah
Osaka Institute of Technology, Osaka, Japan
Kenji Matsui
University Rey Juan Carlos, Madrid, Spain
Sascha Ossowski
Biotechnology, Intelligent Systems and Educational Technology (BISITE) Research Group, University of Salamanca, Salamanca, Spain
Sara Rodríguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferraria, M.A., Balbi, P.P., de Castro, L.N. (2025). An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm. In: Chinthaginjala, R., Sitek, P., Min-Allah, N., Matsui, K., Ossowski, S., Rodríguez, S. (eds) Distributed Computing and Artificial Intelligence, 21st International Conference. DCAI 2024. Lecture Notes in Networks and Systems, vol 1259. Springer, Cham. https://doi.org/10.1007/978-3-031-82073-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-82073-1_25
Published: 18 February 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-82072-4
Online ISBN: 978-3-031-82073-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm