Abstract
User-generated content on social media platforms has reached big data levels. Sentiment analysis of this data provides opportunities to gain valuable insights into any domain. However, analyzing real-world data may confront the challenge of class imbalance, which can adversely affect the generalization ability of models due to majority class overfitting. Therefore, having an efficient model that manages any scenario of imbalanced data is practically needed. In this light, this work proposes different models based on studying the impact of data quality and transfer learning through pre-trained embeddings on boosting minority class detection. The proposed models are tested on imbalanced datasets related to social media and education. The experimental results highlight the effectiveness of Wor2vec, Glove, and Fasttext embeddings with preprocessed data. In contrast, BERT embeddings present better results with no-preprocessed data. Furthermore, in comparison with other methods, the best-performing model resulting from this study shows outperformance with notable improvements.
Similar content being viewed by others
Data availability
The datasets used during the current study are publicly available online at https://data.mendeley.com/datasets/6ndwt6s5ry/1 and https://www.kaggle.com/datasets/septa97/100k-courseras-course-reviews-dataset.
References
Ghani NA, Hamid S, Hashem IAT, Ahmed E (2019) Social media big data analytics: a survey. Comput Human Behav 101:417–428
Kordzadeh N, Young DK (2020) How social media analytics can inform content strategies. J Comput Inform Syst. 62:1–13
Iqbal A, Amin R, Iqbal J, Alroobaea R, Binmahfoudh A, Hussain M (2022) Sentiment analysis of consumer reviews using deep learning. Sustainability 14(17):10844
Arya V, Mishra AKM, Gonzalez-Briones A et al (2022) Analysis of sentiments on the onset of COVID-19 using machine learning techniques. Adv Distrib Comput Artif Intell 11:45–63
Chang YC, Ku CH, Le Nguyen DD (2022) Predicting aspect-based sentiment using deep learning and information visualization: the impact of COVID-19 on the airline industry. Inform Manag 59(2):103587
Matalon Y, Magdaci O, Almozlino A, Yamin D (2021) Using sentiment analysis to predict opinion inversion in Tweets of political communication. Sci. Rep 11(1):1–9
Mee A, Homapour E, Chiclana F, Engel O (2021) Sentiment analysis using TF-IDF weighting of UK MPs’ tweets on Brexit. KnowlSyst 228:107238
Tang Y, Hew KF (2017) Using Twitter for education: beneficial or simply a waste of time? Comput Educ 106:97–118
Stathopoulou A, Siamagka NT, Christodoulides G (2019) A multi-stakeholder view of social media as a supporting tool in higher education: an educator-student perspective. Eur Manag J 37(4):421–431
Jaremko KM, Schwenk ES, Pearson ACS, Hagedorn J, Udani AD, Schwartz G et al (2019) Teaching an old pain medicine Society new tweets: integrating social media into continuing medical education. Korean J Anesthesiol 72(5):409
Motta J, Barbosa M (2018) Social media as a marketing tool for European and North American universities and colleges. J Intercult Manag 10(3):125–154
Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th international acm sigir conference on research and development in information retrieval; 2015. p. 959–962
Rehman AU, Malik AK, Raza B, Ali W (2019) A hybrid CNN-LSTM model for improving accuracy of movie reviews sentiment analysis. Multimed Tools Appl 78(18):26597–26613
Pandey H, Mishra AK, Kumar DN. Various aspects of sentiment analysis: a review. In: Proceedings of 2nd international conference on advanced computing and software engineering (ICACSE). 2019
Habimana O, Li Y, Li R, Gu X, Yu G (2020) Sentiment analysis using deep learning approaches: an overview. Sci China Inform Sci 63:1–36
Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Sadr H, Nazari Soleimandarabi M (2022) ACNN-TL: attention-based convolutional neural network coupling with transfer learning and contextualized word representation for enhancing the performance of sentiment classification. J Supercomput 78:1–27
Nguyen CV, Le KH, Tran AM, Pham QH, Nguyen BT (2022) Learning for amalgamation: a multi-source transfer learning framework for sentiment lassification. Inform Sci 590:1–14
Sivakumar S, Rajalakshmi R (2022) Context-aware sentiment analysis with attention-enhanced features from bidirectional transformers. Soc Netw Anal Min 12(1):104. https://doi.org/10.1007/s13278-022-00910-y
Chan JYL, Bea KT, Leow SMH, Phoong SW, Cheng WK (2022) State of the art: a review of sentiment analysis based on sequential transfer learning. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10183-8
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inform Process Syst 26
Pennington J, Socher R, Manning CD. (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532–1543
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018;
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. Trans Assoc Comput Linguist 8:842–866
Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253
Dang CN, Moreno-García MN, De la Prieta F (2021) Hybrid deep learning models for sentiment analysis. Complexity 9:9986920
Xu G, Meng Y, Qiu X, Yu Z, Wu X (2019) Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7:51522–51532
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing. 337:325–338
Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) ABCDM: an attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener Comput Syst 115:279–294
Bhuvaneshwari P, Rao AN, Robinson YH, Thippeswamy MN (2022) Sentiment analysis for user reviews using Bi-LSTM self-attention based CNN model. Multimed Tools Appl 81(9):12405–12419. https://doi.org/10.1007/s11042-022-12410-4
Jain PK, Saravanan V, Pamula R (2021) A hybrid CNN-LSTM: a deep learning approach for consumer sentiment analysis using qualitative user-generated contents. Trans Asian Low Resour Language Inform Process 20(5):1–15
Ramaswamy SL, Chinnappan J (2022) RecogNet-LSTM+ CNN: a hybrid network with attention mechanism for aspect categorization and sentiment classification. J Intell Inform Syst 58(2):379–404
Ayetiran EF (2022) Attention-based aspect sentiment classification using enhanced learning through CNN-BiLSTM networks. Knowl Based Syst 252:109409
Rani S, Bashir AK, Alhudhaif A, Koundal D, Gunduz ES et al (2022) An efficient CNN-LSTM model for sentiment detection in# BlackLivesMatter. Expert Syst Appl 193:116256
Yin W, Schütze H (2018) Attentive convolution: equipping cnns with rnn-style attention mechanisms. Trans Assoc Comput Linguist 6:687–702
Liu Y, Ji L, Huang R, Ming T, Gao C, Zhang J (2019) An attention-gated convolutional neural network for sentence classification. Intell Data Anal. 23(5):1091–1107
Liao W, Zhou J, Wang Y, Yin Y, Zhang X (2022) Fine-grained attention-based phrase-aware network for aspect-level sentiment analysis. Artif Intell Rev 55(5):3727–3746. https://doi.org/10.1007/s10462-021-10080-6
Wadawadagi R, Pagi V (2022) Polarity enriched attention network for aspect-based sentiment analysis. International Journal of Information Technology. 14(6):2767–2778. https://doi.org/10.1007/s41870-022-01089-3
Liu S, Zhang K (2020) Under-sampling and feature selection algorithms for S2SMLP. IEEE Access. 8:191803–191814
Ling CX, Li C. Data Mining for Direct Marketing: Problems and Solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press; 1998. p. 73–79
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Wei J, Zou K. Eda (2019) Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196
Kumar V, Choudhary A, Cho E. (2020) Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245
Garg S, Ramakrishnan G. Bae (2020) Bert-based adversarial examples for text classification. arXiv preprint arXiv:2004.01970
Kobayashi S. (2018) Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201
Moreno Barea FJ, Jerez JM, Franco L (2020) Improving classification accuracy using data augmentation on small data sets. Exp Syst Appl 161:113696
Wu JL, Huang S (2022) Application of generative adversarial networks and Shapley algorithm based on easy data augmentation for imbalanced text data. Appl Sci 12(21):10964
Huang B, Guo R, Zhu Y, Fang Z, Zeng G, Liu J et al (2022) Aspect-level sentiment analysis with aspect-specific context position information. Knowl Syst 243:108473
Madabushi HT, Kochkina E, Castelle M. (2020) Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv preprint arXiv:2003.11563
Siagh A, Laallam FZ, Kazar O. (2022) Building a multilingual corpus of tweets relating to algerian higher education. In: International conference on intelligent systems and pattern recognition. Springer, p. 132–138
Pennington J, Socher R, Manning CD. (2014) GloVe: Global Vectors for Word Representation. In: Empirical methods in natural language processing (EMNLP) p. 1532–1543. Available from: http://www.aclweb.org/anthology/D14-1162
Sanh V, Debut L, Chaumond J, Wolf T. (2020) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv. Available from: arXiv:1910.01108
Funding
The authors gratefully acknowledge financial support from “La Direction Générale de la Recherche Scientifique et du Développement Technologique (DGRSDT)” of Algeria.
Author information
Authors and Affiliations
Contributions
Conceptualization: A.S.; Methodology: A.S.; Writing—original draft preparation: A.S.; Supervision: F.Z.L.; Supervision: O.K.; Review and editing: H.S.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Siagh, A., Laallam, F.Z., Kazar, O. et al. An improved sentiment classification model based on data quality and word embeddings. J Supercomput 79, 11871–11894 (2023). https://doi.org/10.1007/s11227-023-05099-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05099-1