Skip to main content
Log in

An improved sentiment classification model based on data quality and word embeddings

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

User-generated content on social media platforms has reached big data levels. Sentiment analysis of this data provides opportunities to gain valuable insights into any domain. However, analyzing real-world data may confront the challenge of class imbalance, which can adversely affect the generalization ability of models due to majority class overfitting. Therefore, having an efficient model that manages any scenario of imbalanced data is practically needed. In this light, this work proposes different models based on studying the impact of data quality and transfer learning through pre-trained embeddings on boosting minority class detection. The proposed models are tested on imbalanced datasets related to social media and education. The experimental results highlight the effectiveness of Wor2vec, Glove, and Fasttext embeddings with preprocessed data. In contrast, BERT embeddings present better results with no-preprocessed data. Furthermore, in comparison with other methods, the best-performing model resulting from this study shows outperformance with notable improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The datasets used during the current study are publicly available online at https://data.mendeley.com/datasets/6ndwt6s5ry/1 and https://www.kaggle.com/datasets/septa97/100k-courseras-course-reviews-dataset.

Notes

  1. https://www.kaggle.com/datasets/septa97/100k-courseras-course-reviews-dataset.

  2. https://code.google.com/archive/p/word2vec/.

  3. https://fasttext.cc/docs/en/crawl-vectors.html.

  4. https://huggingface.co/.

  5. https://github.com/google-research/bert.

References

  1. Ghani NA, Hamid S, Hashem IAT, Ahmed E (2019) Social media big data analytics: a survey. Comput Human Behav 101:417–428

    Article  Google Scholar 

  2. Kordzadeh N, Young DK (2020) How social media analytics can inform content strategies. J Comput Inform Syst. 62:1–13

    Google Scholar 

  3. Iqbal A, Amin R, Iqbal J, Alroobaea R, Binmahfoudh A, Hussain M (2022) Sentiment analysis of consumer reviews using deep learning. Sustainability 14(17):10844

    Article  Google Scholar 

  4. Arya V, Mishra AKM, Gonzalez-Briones A et al (2022) Analysis of sentiments on the onset of COVID-19 using machine learning techniques. Adv Distrib Comput Artif Intell 11:45–63

  5. Chang YC, Ku CH, Le Nguyen DD (2022) Predicting aspect-based sentiment using deep learning and information visualization: the impact of COVID-19 on the airline industry. Inform Manag 59(2):103587

    Article  Google Scholar 

  6. Matalon Y, Magdaci O, Almozlino A, Yamin D (2021) Using sentiment analysis to predict opinion inversion in Tweets of political communication. Sci. Rep 11(1):1–9

    Article  Google Scholar 

  7. Mee A, Homapour E, Chiclana F, Engel O (2021) Sentiment analysis using TF-IDF weighting of UK MPs’ tweets on Brexit. KnowlSyst 228:107238

    Google Scholar 

  8. Tang Y, Hew KF (2017) Using Twitter for education: beneficial or simply a waste of time? Comput Educ 106:97–118

    Article  Google Scholar 

  9. Stathopoulou A, Siamagka NT, Christodoulides G (2019) A multi-stakeholder view of social media as a supporting tool in higher education: an educator-student perspective. Eur Manag J 37(4):421–431

    Article  Google Scholar 

  10. Jaremko KM, Schwenk ES, Pearson ACS, Hagedorn J, Udani AD, Schwartz G et al (2019) Teaching an old pain medicine Society new tweets: integrating social media into continuing medical education. Korean J Anesthesiol 72(5):409

    Article  Google Scholar 

  11. Motta J, Barbosa M (2018) Social media as a marketing tool for European and North American universities and colleges. J Intercult Manag 10(3):125–154

    Article  Google Scholar 

  12. Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th international acm sigir conference on research and development in information retrieval; 2015. p. 959–962

  13. Rehman AU, Malik AK, Raza B, Ali W (2019) A hybrid CNN-LSTM model for improving accuracy of movie reviews sentiment analysis. Multimed Tools Appl 78(18):26597–26613

    Article  Google Scholar 

  14. Pandey H, Mishra AK, Kumar DN. Various aspects of sentiment analysis: a review. In: Proceedings of 2nd international conference on advanced computing and software engineering (ICACSE). 2019

  15. Habimana O, Li Y, Li R, Gu X, Yu G (2020) Sentiment analysis using deep learning approaches: an overview. Sci China Inform Sci 63:1–36

    Article  Google Scholar 

  16. Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  17. Sadr H, Nazari Soleimandarabi M (2022) ACNN-TL: attention-based convolutional neural network coupling with transfer learning and contextualized word representation for enhancing the performance of sentiment classification. J Supercomput 78:1–27

    Article  Google Scholar 

  18. Nguyen CV, Le KH, Tran AM, Pham QH, Nguyen BT (2022) Learning for amalgamation: a multi-source transfer learning framework for sentiment lassification. Inform Sci 590:1–14

    Article  Google Scholar 

  19. Sivakumar S, Rajalakshmi R (2022) Context-aware sentiment analysis with attention-enhanced features from bidirectional transformers. Soc Netw Anal Min 12(1):104. https://doi.org/10.1007/s13278-022-00910-y

    Article  Google Scholar 

  20. Chan JYL, Bea KT, Leow SMH, Phoong SW, Cheng WK (2022) State of the art: a review of sentiment analysis based on sequential transfer learning. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10183-8

    Article  Google Scholar 

  21. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inform Process Syst 26

  22. Pennington J, Socher R, Manning CD. (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532–1543

  23. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  24. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018;

  25. Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. Trans Assoc Comput Linguist 8:842–866

    Article  Google Scholar 

  26. Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253

    Article  Google Scholar 

  27. Dang CN, Moreno-García MN, De la Prieta F (2021) Hybrid deep learning models for sentiment analysis. Complexity 9:9986920

    Google Scholar 

  28. Xu G, Meng Y, Qiu X, Yu Z, Wu X (2019) Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7:51522–51532

    Article  Google Scholar 

  29. Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing. 337:325–338

    Article  Google Scholar 

  30. Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) ABCDM: an attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener Comput Syst 115:279–294

    Article  Google Scholar 

  31. Bhuvaneshwari P, Rao AN, Robinson YH, Thippeswamy MN (2022) Sentiment analysis for user reviews using Bi-LSTM self-attention based CNN model. Multimed Tools Appl 81(9):12405–12419. https://doi.org/10.1007/s11042-022-12410-4

    Article  Google Scholar 

  32. Jain PK, Saravanan V, Pamula R (2021) A hybrid CNN-LSTM: a deep learning approach for consumer sentiment analysis using qualitative user-generated contents. Trans Asian Low Resour Language Inform Process 20(5):1–15

    Article  Google Scholar 

  33. Ramaswamy SL, Chinnappan J (2022) RecogNet-LSTM+ CNN: a hybrid network with attention mechanism for aspect categorization and sentiment classification. J Intell Inform Syst 58(2):379–404

    Article  Google Scholar 

  34. Ayetiran EF (2022) Attention-based aspect sentiment classification using enhanced learning through CNN-BiLSTM networks. Knowl Based Syst 252:109409

    Article  Google Scholar 

  35. Rani S, Bashir AK, Alhudhaif A, Koundal D, Gunduz ES et al (2022) An efficient CNN-LSTM model for sentiment detection in# BlackLivesMatter. Expert Syst Appl 193:116256

    Article  Google Scholar 

  36. Yin W, Schütze H (2018) Attentive convolution: equipping cnns with rnn-style attention mechanisms. Trans Assoc Comput Linguist 6:687–702

    Article  Google Scholar 

  37. Liu Y, Ji L, Huang R, Ming T, Gao C, Zhang J (2019) An attention-gated convolutional neural network for sentence classification. Intell Data Anal. 23(5):1091–1107

    Article  Google Scholar 

  38. Liao W, Zhou J, Wang Y, Yin Y, Zhang X (2022) Fine-grained attention-based phrase-aware network for aspect-level sentiment analysis. Artif Intell Rev 55(5):3727–3746. https://doi.org/10.1007/s10462-021-10080-6

    Article  Google Scholar 

  39. Wadawadagi R, Pagi V (2022) Polarity enriched attention network for aspect-based sentiment analysis. International Journal of Information Technology. 14(6):2767–2778. https://doi.org/10.1007/s41870-022-01089-3

    Article  Google Scholar 

  40. Liu S, Zhang K (2020) Under-sampling and feature selection algorithms for S2SMLP. IEEE Access. 8:191803–191814

    Article  Google Scholar 

  41. Ling CX, Li C. Data Mining for Direct Marketing: Problems and Solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press; 1998. p. 73–79

  42. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239

    Article  Google Scholar 

  43. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  44. Wei J, Zou K. Eda (2019) Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196

  45. Kumar V, Choudhary A, Cho E. (2020) Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245

  46. Garg S, Ramakrishnan G. Bae (2020) Bert-based adversarial examples for text classification. arXiv preprint arXiv:2004.01970

  47. Kobayashi S. (2018) Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201

  48. Moreno Barea FJ, Jerez JM, Franco L (2020) Improving classification accuracy using data augmentation on small data sets. Exp Syst Appl 161:113696

    Article  Google Scholar 

  49. Wu JL, Huang S (2022) Application of generative adversarial networks and Shapley algorithm based on easy data augmentation for imbalanced text data. Appl Sci 12(21):10964

    Article  Google Scholar 

  50. Huang B, Guo R, Zhu Y, Fang Z, Zeng G, Liu J et al (2022) Aspect-level sentiment analysis with aspect-specific context position information. Knowl Syst 243:108473

    Article  Google Scholar 

  51. Madabushi HT, Kochkina E, Castelle M. (2020) Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv preprint arXiv:2003.11563

  52. Siagh A, Laallam FZ, Kazar O. (2022) Building a multilingual corpus of tweets relating to algerian higher education. In: International conference on intelligent systems and pattern recognition. Springer, p. 132–138

  53. Pennington J, Socher R, Manning CD. (2014) GloVe: Global Vectors for Word Representation. In: Empirical methods in natural language processing (EMNLP) p. 1532–1543. Available from: http://www.aclweb.org/anthology/D14-1162

  54. Sanh V, Debut L, Chaumond J, Wolf T. (2020) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv. Available from: arXiv:1910.01108

Download references

Funding

The authors gratefully acknowledge financial support from “La Direction Générale de la Recherche Scientifique et du Développement Technologique (DGRSDT)” of Algeria.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: A.S.; Methodology: A.S.; Writing—original draft preparation: A.S.; Supervision: F.Z.L.; Supervision: O.K.; Review and editing: H.S.

Corresponding author

Correspondence to Asma Siagh.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Siagh, A., Laallam, F.Z., Kazar, O. et al. An improved sentiment classification model based on data quality and word embeddings. J Supercomput 79, 11871–11894 (2023). https://doi.org/10.1007/s11227-023-05099-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05099-1

Keywords

Navigation