Skip to main content
Log in

On the importance of pre-processing in small-scale analyses of twitter: a case study of the 2019 Indian general election

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The main purpose of this paper is to emphasize the role of data pre-processing in the sentiment analysis of Twitter data. The paper provides detailed analysis and methods to understand and handle Twitter data for analyzing public views during elections. We argue that in order to accurately assess public opinion towards a political party or leader, there is a need to focus on users’ personal tweets rather than tweets from news or media sources. We also argue that emojis, punctuations, stopwords, emphasized words, and some specific regions (Unicode, #, @) inside tweets play a very significant role in analyzing sentiments. In view of this, this paper provides a novel set of pre-processing steps that perform filtering and cleaning of tweets without losing any vital information. For experimentation, a small case study is taken that comprises 258,891 instances related to the 2019 Indian General Election from Twitter using #LoksabhaElection2019. A pre-trained sentiment analysis model called twitter-xlm-roberta-base-sentiment is used to analyze the sentiment of public tweets. Results show that tweets from media sources and the specific regions of tweets inject data bias and affect final sentiment analysis results. We found that out of the collected data, only 40% of tweets were useful for determining public sentiments for election analysis, while the rest were irrelevant media tweets. Also, an increase in negative and neutral sentiment outputs is observed due to the presence of media tweets and the specific regions. Further, explorative analysis analyzes public sentiments towards various political terms inferred using top2vec topic modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1:
Fig. 3
Fig. 4
Fig. 5
Algorithm 2:
Fig. 6
Fig. 7
Fig. 8
Algorithm 3:
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data availability

The dataset generated during and/or analyzed during the current study is available from the corresponding author on reasonable request.

Notes

  1. https://textblob.readthedocs.io/en/dev/

  2. https://github.com/cjhutto/vaderSentiment

  3. http://sentistrength.wlv.ac.uk/

  4. http://wndomains.fbk.eu/wnaffect.html

  5. https://www.liwc.app/

  6. https://osf.io/y6g5b/wiki/anew/

  7. https://unicode.org/emoji/charts-14.0/full-emoji-list.html

  8. https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment

  9. https://about.twitter.com/en

  10. https://boostlabs.com/blog/what-are-word-clouds-value-simple-visualizations/

  11. https://developer.twitter.com/en/products/twitter-api/academic-research

References

  1. Abdullah M, AlMasawa M, Makki I et al (2020) Emotions extraction from Arabic tweets. Int J Comput Appl 42:661–675. https://doi.org/10.1080/1206212X.2018.1482395

    Article  Google Scholar 

  2. Agarwal A, Toshniwal D, Bedi J (2020) Can twitter help to predict outcome of 2019 Indian general election: a deep learning based study. In: Communications in Computer and Information Science. pp. 38–53

  3. Al Hamoud A, Alwehaibi A, Roy K, Bikdash M (2018) Classifying political tweets using naïve bayes and support vector machines. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer International Publishing, pp. 736–744

  4. Alam S, Yao N (2019) The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput Math Org Theory 25:319–335. https://doi.org/10.1007/s10588-018-9266-8

    Article  Google Scholar 

  5. Ali H, Farman H, Yar H et al (2022) Deep learning-based election results prediction using twitter activity. Soft Comput 26:7535–7543. https://doi.org/10.1007/s00500-021-06569-5

    Article  Google Scholar 

  6. Angelov D (2020) Top2Vec: distributed representations of topics. arXiv preprint arXiv 1–25

  7. Antonakaki D, Fragopoulou P, Ioannidis S (2021) A survey of twitter research: data model, graph structure, sentiment analysis and attacks. Expert Syst Appl 164:114006. https://doi.org/10.1016/j.eswa.2020.114006

    Article  Google Scholar 

  8. Appel O, Chiclana F, Carter J, Fujita H (2016) A hybrid approach to sentiment analysis. In: 2016 IEEE congress on evolutionary computation (CEC). IEEE, pp 4950–4957

  9. Asghar MZ, Kundi FM, Ahmad S et al (2018) T-SAF: twitter sentiment analysis framework using a hybrid classification scheme. Expert Syst 35(1):e12233. https://doi.org/10.1111/exsy.12233

    Article  Google Scholar 

  10. Awais M, Hassan S-U, Ahmed A (2021) Leveraging big data for politics: predicting general election of Pakistan using a novel rigged model. J Ambient Intell Humaniz Comput 12:4305–4313. https://doi.org/10.1007/s12652-019-01378-z

    Article  Google Scholar 

  11. Babu NV, Kanaga EGM (2022) Sentiment analysis in social media data for depression detection using artificial intelligence: a review. SN Comput Sci 3:1–20. https://doi.org/10.1007/s42979-021-00958-1

    Article  Google Scholar 

  12. Bahri S, Bahri P, Lal S (2018) A novel approach of sentiment classification using emoticons. In: Procedia Computer Science. pp. 669–678

  13. Baltrusaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41:423–443. https://doi.org/10.1109/TPAMI.2018.2798607

    Article  PubMed  Google Scholar 

  14. Bansal B, Srivastava S (2019) Lexicon-based twitter sentiment analysis for vote share prediction using emoji and N-gram features. Int J Web Based Commun 15:85–99. https://doi.org/10.1504/IJWBC.2019.098693

    Article  Google Scholar 

  15. Barbieri F, Anke LE, Camacho-Collados J (2021) XLM-T: A Multilingual Language Model Toolkit for Twitter arXiv preprint arXiv 2104.12250

  16. Batista-Navarro RT, Kontonatsios G, Mihǎilǎ C et al (2013) Facilitating the analysis of discourse phenomena in an interoperable NLP platform. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNCS 7816:559–571. https://doi.org/10.1007/978-3-642-37247-6_45

    Article  Google Scholar 

  17. Bilal M, Asif S, Yousuf S, Afzal U (2018) 2018 Pakistan general election: understanding the predictive power of social media. In: 12th international conference on mathematics, actuarial science, computer science and statistics, MACS 2018 - proceedings. IEEE, pp 1–6

  18. Birjali M, Kasri M, Beni-Hssane A (2021) A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowl-Based Syst 226:1–26. https://doi.org/10.1016/j.knosys.2021.107134

    Article  Google Scholar 

  19. Bose R, Dey RK, Roy S, Sarddar D (2019) Analyzing political sentiment using twitter data. In: Smart Innovation, Systems and Technologies. pp. 427–436

  20. Budiharto W, Meiliana M (2018) Prediction and analysis of Indonesia presidential election from twitter using sentiment analysis. J Big Data 5:1–10. https://doi.org/10.1186/s40537-018-0164-1

    Article  Google Scholar 

  21. Chakraborty K, Bhattacharyya S, Bag R (2020) A survey of sentiment analysis from social media data. IEEE Trans Comput Soc Syst 7:450–464. https://doi.org/10.1109/TCSS.2019.2956957

    Article  Google Scholar 

  22. Chandra Pandey A, Singh Rajpoot D, Saraswat M (2017) Twitter sentiment analysis using hybrid cuckoo search method. Inf Process Manag 53:764–779. https://doi.org/10.1016/j.ipm.2017.02.004

    Article  Google Scholar 

  23. Chauhan P, Sharma N, Sikka G (2021) The emergence of social media data and sentiment analysis in election prediction. J Ambient Intell Humaniz Comput 12:2601–2627. https://doi.org/10.1007/s12652-020-02423-y

    Article  Google Scholar 

  24. Curiskis SA, Drake B, Osborn TR, Kennedy PJ (2020) An evaluation of document clustering and topic modelling in two online social networks: twitter and Reddit. Inf Process Manag 57:102034. https://doi.org/10.1016/j.ipm.2019.04.002

    Article  Google Scholar 

  25. Dangi D, Dixit DK, Bhagat A (2022) Sentiment analysis of COVID-19 social media data through machine learning. Multimed Tools Appl 81(29):42261–42283. https://doi.org/10.1007/s11042-022-13492-w

    Article  PubMed  PubMed Central  Google Scholar 

  26. Duncombe C (2019) The politics of twitter: emotions and the power of social media. Int Political Sociol 13:409–429. https://doi.org/10.1093/ips/olz013

    Article  Google Scholar 

  27. Feldman R (2013) Techniques and applications for sentiment analysis. Commun ACM 56:82. https://doi.org/10.1145/2436256.2436274

    Article  Google Scholar 

  28. Gandhi A, Adhvaryu K, Poria S et al (2023) Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fus 91:424–444. https://doi.org/10.1016/j.inffus.2022.09.025

    Article  Google Scholar 

  29. Gayo-Avello D (2011) Don’t turn social media into another “literary digest” poll. Commun ACM 54:121–128. https://doi.org/10.1145/2001269.2001297

    Article  Google Scholar 

  30. Gayo-avello D, Metaxas PT, Mustafaraj E (2011) Limits of electoral predictions using social media data. In: Fifth International AAAI Conference on Weblogs and Social Media

  31. Gustisa Wisnu GR, Ahmadi MAR et al (2020) Sentiment analysis and topic modelling of 2018 central java gubernatorial election using twitter data. 2020 international workshop on big data and information security. IWBIS 2020:35–40. https://doi.org/10.1109/IWBIS50925.2020.9255583

    Article  Google Scholar 

  32. Heredia B, Prusa JD, Khoshgoftaar TM (2018) Social media for polling and predicting United States election outcome. Soc Netw Anal Min 8:1–16. https://doi.org/10.1007/s13278-018-0525-y

    Article  Google Scholar 

  33. Jacobi C, Van Atteveldt W, Welbers K (2016) Quantitative analysis of large amounts of journalistic texts using topic modelling. Digit J 4(1):89–106. https://doi.org/10.1080/21670811.2015.1093271

    Article  Google Scholar 

  34. Jain VK, Kumar S (2017) Towards prediction of election outcomes using social media. Int J Intell Syst Appl 9:20–28. https://doi.org/10.5815/ijisa.2017.12.03

    Article  Google Scholar 

  35. Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879. https://doi.org/10.1109/ACCESS.2017.2672677

    Article  Google Scholar 

  36. Karami A, Bennett LS, He X (2018) Mining public opinion about economic issues. Int J Strat Decision Sci 9:18–28. https://doi.org/10.4018/ijsds.2018010102

    Article  Google Scholar 

  37. Khan A, Zhang H, Boudjellal N et al (2021) Election prediction on twitter: a systematic mapping study. Complexity 1–27. https://doi.org/10.1155/2021/5565434

  38. Kharde VA, Sonawane SS (2016) Sentiment analysis of twitter data: a survey of techniques. Int J Comput Appl 139:5–15. https://doi.org/10.5120/ijca2016908625

    Article  Google Scholar 

  39. Khatua A, Khatua A, Ghosh K, Chaki N (2015) Can #Twitter_Trends predict election results? Evidence from 2014 Indian general election. In: 2015 48th Hawaii international conference on system sciences. IEEE, pp 1676–1685

  40. Le CJY, Bea KT, Leow SMH et al (2023) State of the art: a review of sentiment analysis based on sequential transfer learning. Artif Intell Rev 56:749–780. https://doi.org/10.1007/s10462-022-10183-8

    Article  Google Scholar 

  41. Liu B (2012) Sentiment analysis and opinion mining. Synth Lect Human Lang Technol 5:1–167. https://doi.org/10.2200/S00416ED1V01Y201204HLT016

    Article  ADS  Google Scholar 

  42. Liu R, Shi Y, Ji C, Jia M (2019) A survey of sentiment analysis based on transfer learning. IEEE Access 7:85401–85412. https://doi.org/10.1109/ACCESS.2019.2925059

    Article  Google Scholar 

  43. Liu R, Yao X, Guo C, Wei X (2021) Can we forecast presidential election using twitter data? An integrative modelling approach. Ann GIS 27:43–56. https://doi.org/10.1080/19475683.2020.1829704

    Article  Google Scholar 

  44. Liu C, Fang F, Lin X et al (2021) Improving sentiment analysis accuracy with emoji embedding. J Safety Sci Resilience 2:246–252. https://doi.org/10.1016/j.jnlssr.2021.10.003

    Article  Google Scholar 

  45. Makazhanov A, Rafiei D, Waqar M (2014) Predicting political preference of twitter users. Soc Netw Anal Min 4:1–15. https://doi.org/10.1007/s13278-014-0193-5

    Article  Google Scholar 

  46. Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 5(4):1093–1113. https://doi.org/10.1016/j.asej.2014.04.011

    Article  Google Scholar 

  47. Mohbey KK (2020) Multi-class approach for user behavior prediction using deep learning framework on twitter election dataset. J Data, Inf Manag 2:1–14. https://doi.org/10.1007/s42488-019-00013-y

    Article  Google Scholar 

  48. Nandwani P, Verma R (2021) A review on sentiment analysis and emotion detection from text. Soc Netw Anal Min 11:1–19. https://doi.org/10.1007/s13278-021-00776-6

    Article  Google Scholar 

  49. Naseem U, Razzak I, Eklund PW (2021) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimed Tools Appl 80:35239–35266. https://doi.org/10.1007/s11042-020-10082-6

    Article  Google Scholar 

  50. Naz H, Ahuja S, Kumar D, Rishu (2021) DT-FNN based effective hybrid classification scheme for twitter sentiment analysis. Multimed Tools Appl 80:11443–11458. https://doi.org/10.1007/s11042-020-10190-3

  51. Oikonomou L, Tjortjis C (2018) A method for predicting the winner of the USA presidential elections using data extracted from twitter. In: proceedings of south-eastern European design automation, computer engineering, computer networks and society media conference (SEEDA_CECNSM). TEI OF WESTERN MACEDONIA, pp 1–8

  52. Ravi K, Ravi V (2015) A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl-Based Syst 89:14–46. https://doi.org/10.1016/j.knosys.2015.06.015

    Article  Google Scholar 

  53. Rojas-Barahona LM (2016) Deep learning for sentiment analysis. Lang Linguist Compass 10:701–719. https://doi.org/10.1111/lnc3.12228

    Article  Google Scholar 

  54. Sahi G (2022) Public sentiment on Ayodhya verdict by the supreme court of India. Int J Inf Commun Technol Human Dev 14(1):1–17. https://doi.org/10.4018/ijicthd.295561

    Article  Google Scholar 

  55. Salunkhe P, Deshmukh S (2017) Twitter based election prediction and analysis. Int Res J Eng Technol 4:539–544

    Google Scholar 

  56. Sánchez-Rada JF, Iglesias CA (2019) Social context in sentiment analysis: formal definition, overview of current trends and framework for comparison. Inf Fus 52:344–356. https://doi.org/10.1016/j.inffus.2019.05.003

    Article  Google Scholar 

  57. Santos JS, Bernardini F, Paes A (2021) A survey on the use of data and opinion mining in social media to political electoral outcomes prediction. Soc Netw Anal Min 11:1–39. https://doi.org/10.1007/s13278-021-00813-4

    Article  Google Scholar 

  58. Sharma P, Moh TS (2016) Prediction of Indian election using sentiment analysis on Hindi twitter. In: proceedings - 2016 IEEE international conference on big data, big data 2016. IEEE, pp 1966–1971

  59. Shi L, Agarwal N, Agrawal A, et al (2012) Predicting US primary elections with twitter. In: workshop social network and social media analysis: methods, models and applications (NIPS). Pp 1–8

  60. Singh AK, Gupta DK, Singh RM (2017) Sentiment analysis of twitter user data on Punjab legislative assembly election, 2017. Int J Modern Educ Comput Sci 9:60–68. https://doi.org/10.5815/ijmecs.2017.09.07

    Article  Google Scholar 

  61. Singh P, Sawhney RS, Kahlon KS (2017) Forecasting the 2016 US presidential elections using sentiment analysis. Int Federation Inf Process 2017:412–423

    Google Scholar 

  62. Singh P, Dwivedi YK, Kahlon KS et al (2020) Can twitter analytics predict election outcome? An insight from 2017 Punjab assembly elections. Gov Inf Q 37:101444. https://doi.org/10.1016/j.giq.2019.101444

    Article  Google Scholar 

  63. Singhal K, Agrawal B, Mittal N (2015) Modeling Indian general elections: sentiment analysis of political twitter data. In: Information Systems Design and Intelligent Applications: Proceedings of Second International Conference INDIA 2015, Volume 1, pp. 469–477, Springer India

  64. Sohrabi MK, Hemmatian F (2019) An efficient preprocessing method for supervised sentiment analysis by converting sentences to numerical vectors: a twitter case study. Multimed Tools Appl 78:24863–24882. https://doi.org/10.1007/s11042-019-7586-4

    Article  Google Scholar 

  65. Soleymani M, Garcia D, Jou B et al (2017) A survey of multimodal sentiment analysis. Image Vis Comput 65:3–14. https://doi.org/10.1016/j.imavis.2017.08.003

    Article  Google Scholar 

  66. Spina S (2019) Role of emoticons as structural markers in twitter interactions. Discourse Process 56(4):345–362. https://doi.org/10.1080/0163853X.2018.1510654

    Article  Google Scholar 

  67. Stieglitz S, Dang-Xuan L (2013) Social media and political communication: a social media analytics framework. Soc Netw Anal Min 3:1277–1291. https://doi.org/10.1007/s13278-012-0079-3

    Article  Google Scholar 

  68. Stieglitz S, Brockmann T, Xuan LD (2012) Usage of social media for political communication. Proceedings - Pacific Asia Conference on Information Systems, PACIS

  69. Symeonidis S, Effrosynidis D, Arampatzis A (2018) A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl 110:298–310. https://doi.org/10.1016/j.eswa.2018.06.022

    Article  Google Scholar 

  70. Tomažič T, Mišič KU (2019) Parliament-citizen communication in terms of local self-government and their use of social media in the European Union. Lex Localis - Journal of Local Self - Government 17(4):1057–1079

    Google Scholar 

  71. Wankhade M, Rao ACS, Kulkarni C (2022) A survey on sentiment analysis methods, applications, and challenges. Artif Intell Rev 55(7):5731–5780. https://doi.org/10.1007/s10462-022-10144-1

    Article  Google Scholar 

  72. Wankhede S, Patil R, Sonawane S, Save PA (2018) Data preprocessing for efficient sentimental analysis. In: proceedings of the international conference on inventive communication and computational technologies, ICICCT 2018. IEEE, pp 723–726

  73. Yadav A, Vishwakarma DK (2019) Sentiment analysis using deep learning architectures: a review. Artif Intell Rev 53(6):4335–4385. https://doi.org/10.1007/s10462-019-09794-5

    Article  Google Scholar 

  74. Yu J, Jiang J, Xia R (2020) Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans Audio Speech Lang Process 28:429–439. https://doi.org/10.1109/TASLP.2019.2957872

    Article  Google Scholar 

  75. Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscipl Rev Data Min Knowl Discov 8(4)

  76. Zheng A, Casari A (2018) Feature engineering for machine learning. O'Reilly Media, Inc.

  77. Zhou J, Zhao J, Huang JX et al (2021) MASAD: a large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomput 455:47–58. https://doi.org/10.1016/j.neucom.2021.05.040

    Article  Google Scholar 

  78. Zucco C, Calabrese B, Agapito G et al (2020) Sentiment analysis for mining texts and social networks data: methods and tools. Wiley Interdiscipl Rev Data Min Knowl Discov 10(1):e1333. https://doi.org/10.1002/widm.1333

    Article  Google Scholar 

Download references

Funding

No funds, grants, or other support was received.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Priyavrat Chauhan.

Ethics declarations

Conflicts of interests/competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, P., Sharma, N. & Sikka, G. On the importance of pre-processing in small-scale analyses of twitter: a case study of the 2019 Indian general election. Multimed Tools Appl 83, 19219–19258 (2024). https://doi.org/10.1007/s11042-023-16158-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16158-3

Keywords

Navigation