In this work, we compare the performance of a machine learning framework based on a support vector machine (SVM) with fastText embeddings, and a Deep Learning framework consisting on fine-tuning Large Language Models (LLMs) like Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, and Twitter roBERTa Base, to automate the classification of text data to analyze the country image of Mexico in selected data sources, which is described using 18 different classes, based in International Relations theory. To train each model, a data set consisting of tweets from relevant selected Twitter accounts and news headlines from The New York Times is used, based on an initial manual classification of all the entries. However, the data set presents issues in the form of imbalanced classes and few data. Thus, a series of text augmentation techniques are explored: gradual augmentation of the eight less represented classes and an uniform augmentation of the data set. Also, we study the impact of hashtags, user names, stopwords, and emojis as additional text features for the SVM model. The results of the experiments indicate that the SVM reacts negatively to all the data augmentation proposals, while the Deep Learning one shows small benefits from them. The best result of 52.92%, in weighted-average \(F_1\) score, is obtained by fine-tuning the Twitter roBERTa Base model without data augmentation.

Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
The Authors would like to thank Universidad Iberoamericana Ciudad de México and Instituto de Investigación Aplicada y Tecnología for their support and for providing access to the Research Laboratory in Advanced Computer Technologies (LITAC).
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the “Instituto de Investigación Aplicada y Tecnología” and the “Universidad Iberoamericana Ciudad de México”.
LNZ-M: writing original draft, investigation, methodology, programming, data visualization. JAG-O: writing review and editing, methodology, supervision. JEQ-I: writing review and editing, methodology, supervision. CVR: writing review and editing, investigation, supervision.
The authors declare that there is no conflict of interest.
Author’s Google Scholar URLs: Luis N. Zúñiga-Morales, Jorge Ángel González-Ordiano, J. Emilio Quiroz-Ibarra, César Villanueva Rivas.
Appendix: Classic framework train results
Appendix: Classic framework train results
In this appendix we show the results obtained during the training step of the Classic Framework during the data augmentation experiments. As shown in Table 4, as the synthetic data increases, the results obtained during this phase indicate the presence of overfitting in the SVM model. The worst case of this behavior is observed during the AMG All experiment, where all train metrics are above 98%, but the evaluation results indicate performances below 46%, as demonstrated in Table 2. The previous observation further highlights the negative impact of the proposed data augmentation scheme over the SVM model.
Zúñiga-Morales, L.N., González-Ordiano, J.Á., Quiroz-Ibarra, J.E. et al. Machine learning framework for country image analysis. J Comput Soc Sc 7, 523–547 (2024). https://doi.org/10.1007/s42001-023-00246-3
DOI: https://doi.org/10.1007/s42001-023-00246-3