Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil

da Silva Nogueira, Thallys; Siqueira, Kennya Beatriz; Goliatt, Priscila Vanessa Zabala Capriles

doi:10.1007/s13278-024-01254-5

Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil

Original Article
Published: 15 April 2024

Volume 14, article number 85, (2024)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Thallys da Silva Nogueira¹,
Kennya Beatriz Siqueira² &
Priscila Vanessa Zabala Capriles Goliatt¹

96 Accesses
Explore all metrics

Abstract

Creating specific datasets for machine learning models is a frequent and challenging task, requiring considerable effort in sample collection and maintaining a balanced representation of each class. In this study, our objective was to create a training dataset for a sentiment analysis model by combining results obtained from 5 natural language processing tools through 3 distinct approaches, aiming to automatically label various tweets in the negative, neutral, and positive classes. Additionally, we applied data balancing techniques to assess different methods' impacts on the sentiment analysis models' ability to generalize classes to previously unseen samples. The results demonstrated that the three approaches used to combine tool results and apply balancing techniques provided significantly superior outcomes compared to models with imbalanced datasets. These advancements enabled sentiment analysis models to achieve greater precision and generalization capacity for novel samples. These findings underscore the importance of considering effective data balancing strategies when creating training datasets for machine learning applications, especially in tasks sensitive to class imbalance, such as sentiment analysis. This enhanced approach is crucial to improving the performance and applicability of sentiment analysis models in real-world scenarios, providing more precise data analyses that unveil valuable insights in digital marketing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

Notes

Dairy Drinks, Sour Cream, Dulce de Leche, Yogurt, Milk, Condensed Milk, Fermented Milk, Butter, Cheese, and Ice Cream.

References

Barabba T, Zaltaman P (1991) Hearing the voice of the market. Harvard Business School Press, Brighton
Google Scholar
Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735
Article Google Scholar
Cambria E, Schuller B, Xia Y, Havasi C (2013) New avenues in opinion mining and sentiment analysis. Intell Syst IEEE 28:15–21. https://doi.org/10.1109/MIS.2013.30
Article Google Scholar
Chernyaev A, Spryiskov A, Ivashko A, Bidulya Y (2020) A rumor detection in russian tweets. In: Karpov A, Potapova R (eds) Speech and computer. Springer, Cham, pp 108–118
Chapter Google Scholar
D’Andrea A, Ferri F, Grifoni P, Guzzo T (2015) Approaches, tools and applications for sentiment analysis implementation. Int J Comput Appl 125:26–33. https://doi.org/10.5120/ijca2015905866
Article Google Scholar
Deina C, Fogliatto FS, da Silveira GJC et al (2024) Decision analysis framework for predicting no-shows to appointments using machine learning algorithms. BMC Health Serv Res 24:37. https://doi.org/10.1186/s12913-023-10418-6
Article Google Scholar
Farias FL, de Oliveira LSC (2022) Text mining and sentiment analysis applied to Twitter posts about Covid-19 vaccines. Res Soc Dev 11(13):e364111335490. https://doi.org/10.33448/rsd-v11i13.35490
Article Google Scholar
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Oliphant TE (2020) Array programming with NumPy. Nature 585:357–362. https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Hnaif A, Kanan E, Kanan T (2021) Sentiment analysis for arabic social media news polarity. Intell Autom Soft Comput 28:107–119
Article Google Scholar
Hovy E, Lavid J (2010) Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int J Trans 22(1):13–36
Google Scholar
Kearney MW (2019) Rtweet: Collecting and analyzing twitter data. J Open Sour Softw 4(42):1829. https://doi.org/10.21105/joss.01829
Article Google Scholar
Lauriola I, Lavelli A, Aiolli F (2022) An introduction to deep learning in natural language processing: models, techniques, and tools. Neurocomputing 470:443–456. https://doi.org/10.1016/j.neucom.2021.05.103
Article Google Scholar
Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Machine Learn Res 18(17):1–5
Google Scholar
Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies. Springer, Berlin p, pp 1–168
Book Google Scholar
Nogueira TS, Mouro VA, Siqueira KB, Goliatt PVZC (2022) Analysis of the brazilian artisanal cheese market from the perspective of social networks. In: Abraham A, Gandhi N, Hanne T, Hong TP, Nogueira Rios T, Ding W (eds) Intelligent systems design and applications. Springer, Cham. https://doi.org/10.1007/978-3-030-96308-8_84
Chapter Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Machine Learn Res 12:2825–2830
MathSciNet Google Scholar
Rufino HLP, Veiga ACP, Nakamoto PT (2016) Smote_easy: Um algoritmo para tratar o problema de classificação em bases de dados reais. JISTEM JInfSyst Technol Manag 13(1):61–80. https://doi.org/10.4301/S1807-17752016000100004
Article Google Scholar
Saura JR, Palacios-Marqués D, Ribeiro-Soriano D (2021) Using data mining techniques to explore security issues in smart living environments in twitter. Comput Commun 179:285–295. https://doi.org/10.1016/j.comcom.2021.08.021
Article Google Scholar
Usselmann H, Ahmad R, Siemon D (2021) A personality mining system for german twitter posts with global vectors word embedding. IEEE Access 9:165576–165610
Article Google Scholar
Batista G, Bazzan A, Monard M. (2003) Balancing training data for automated annotation of keywords: a case study. In: The Proceedings Of Workshop on Bioinformatics, pp 10–18
Brito EMN (2017) Mineração de Textos: detecção automática de sentimentos em comentários nas mídias sociais. Projetos e Dissertações em Sistemas de Informação e Gestão do Conhecimento, 6
Brum H, Nunes MGV (2018) Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA)
Camacho PAF (2020) Sistema de recomendação em real-time para reserva de transfers. Dissertação de mestrado, Iscte - Instituto Universitário de Lisboa. Repositório do Iscte. http://hdl.handle.net/10071/22131
Cavalcante PEC, Barbosa YAM (2017) Um dataset para análise de sentimmentos na língua portuguesa
Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. ArXiv, abs/1106.1813
Datareportal. Digital 2018: Q4 Global Digital Statshot. (2018) Available from: https://datareportal.com/reports/digital-2018-q4-global-digital-statshot.
Datareportal. Digital 2022 Global Digital Overview. (2022) Available from: https://datareportal.com/reports/digital-2022-global-overview-report.
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328. ISSN 2161–4407
Jonathan B, Putra PH, Ruldeviyani Y (2020) Observation imbalanced data text to predict users selling products on female daily with smote, tomek, and smote-tomek. In:2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), pp 81–85
Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Aji AF, Bogoychev N, Martins AFT, Birch-Mayne A (2018) Marian: Fast Neural Machine Translation in C++. In: The 56th Annual Meeting of the Association for Computational Linguistics. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp 15–20
Kouloumpis E, Wilson T, Moore JD (2011) Twitter Sentiment Analysis: The Good the Bad and the OMG!. In: Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17–21, 2011 (pp. 538–541). AAAI Press. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2857
Lample G, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043
Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics
McKinney W (2010) Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. 445, pp 51–56
Moraes SM, Manssour IH, Silveira MS (2015) 7x1pt: um corpus extraído do twitter para análise de sentimentos em língua portuguesa. In: Anais do X Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp 21–25. SBC
Narayanan R, Liu B, Choudhary A (2009) Sentiment analysis of conditional sentences. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1, pp 180–189. Association for Computational Linguistics
Pinto HL, Rocio V (2019) Combining Sentiment Analysis Scores to Improve Accuracy of Polarity Classification in MOOC Posts. In: Progress in Artificial Intelligence: 19th EPIA Conference on Artificial Intelligence, EPIA 2019, Vila Real, Portugal, September 3–6, 2019, Proceedings, Part I. Springer-Verlag, Berlin, Heidelberg, pp 35–46. https://doi.org/10.1007/978-3-030-30241-2_4
Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 86–96, Berlin, Germany. Association for Computational Linguistics
Silva PS (2016) Avaliação do desempenho de métodos de análise de sentimentos na presença das figuras de linguagem sarcasmo e ironia. 115 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Federal do Sul e Sudeste do Pará, Campus Universitário de Marabá, Instituto de Geociências e Engenharias, Faculdade de Computação e Engenharia Elétrica, Curso de Bacharelado em Sistemas de Informação, Marabá, 2016. Available from: http://repositorio.unifesspa.edu.br/handle/123456789/233
Sridhar S, Sanagavarapu S (2021) Handling Data Imbalance in Predictive Maintenance for Machines using SMOTE-based Oversampling, 2021. In: 13th International Conference on Computational Intelligence and Communication Networks (CICN), Lima, Peru, pp 44–49. https://doi.org/10.1109/CICN51697.2021.9574668
Veríssimo B, Lepre L, Tincani D (2018) Diferenças entre pesquisa de marketing e pesquisa de neuromarketing
Zhang J, Mani I (2003) KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets

Download references

Author information

Authors and Affiliations

Programa de Pós-Graduação Em Modelagem Computacional, Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brasil
Thallys da Silva Nogueira & Priscila Vanessa Zabala Capriles Goliatt
Empresa Brasileira de Pesquisa Agropecuária, Embrapa Gado de Leite, Juiz de Fora, Minas Gerais, Brasil
Kennya Beatriz Siqueira

Authors

Thallys da Silva Nogueira
View author publications
You can also search for this author in PubMed Google Scholar
Kennya Beatriz Siqueira
View author publications
You can also search for this author in PubMed Google Scholar
Priscila Vanessa Zabala Capriles Goliatt
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors actively participated in the manuscript review. Kennya Beatriz Siqueira and Priscila Vanessa Zabala Capriles Goliatt made significant contributions to the review and organization of the text. The implementation and writing of the text were carried out by Thallys da Silva Nogueira.

Corresponding author

Correspondence to Thallys da Silva Nogueira.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

da Silva Nogueira, T., Siqueira, K.B. & Goliatt, P.V.Z.C. Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil. Soc. Netw. Anal. Min. 14, 85 (2024). https://doi.org/10.1007/s13278-024-01254-5

Download citation

Received: 22 November 2023
Accepted: 25 March 2024
Published: 15 April 2024
DOI: https://doi.org/10.1007/s13278-024-01254-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

A review on sentiment analysis and emotion detection from text

Sentiment Analysis in the Age of Generative AI

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

A review on sentiment analysis and emotion detection from text

Sentiment Analysis in the Age of Generative AI

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation