Abstract
Rapid increase in the use of social media has led to the generation of gigabytes of information shared by billions of users worldwide. To analyze this information and determine the behavior of people towards different events, sentiment analysis is widely used by researchers. Existing studies in Urdu sentiment analysis mostly use traditional n-gram features, which unlike linguistic features, do not focus on the contextual information being discussed. Moreover, no existing study classifies sentiments of proverbs and idioms which is challenging as mostly they do not contain sentiment words but carry strong sentiments. This study exploits linguistic features of Urdu language for sentence-level sentiment analysis and classifies idioms and proverbs using classical machine learning techniques. We develop a dataset comprising of idioms, proverbs, and sentences from the news domain, and extract part-of-speech tag-based features, boolean features, and numeric features from the dataset after keen linguistic analysis of Urdu language. Experimental results show that J48 classifier performs best in sentiment classification with an accuracy of 90% and an F-measure of 88%.

Similar content being viewed by others
Data Availability
Not applicable.
Code Availability
Not applicable.
References
Amjad K, Ishtiaq M, Firdous S, Mehmood MA (2017) Exploring twitter news biases using Urdu-based sentiment Lexicon. 11th International Conference on Open Source Systems & Technologies (ICOSST), IEEE, pp 48–53
Abd-Elhamid L, Elzanfaly D, Eldin AS (2016) Feature-based sentiment analysis in online arabic reviews. 11th International Conference on Computer Engineering & Systems (ICCES), IEEE, pp 260–265
Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138
Aziz S, Ullah S, Mughal B, Mushtaq F, Zahra S (2020) Roman Urdu sentiment analysis using machine learning with best parameters and comparative study of machine learning algorithms. Pakistan J Eng Technol 3(2):172–177
Benamara F, Cesarano C, Picariello A, Recupero D, Subrahmanian V (2007) Sentiment analysis: adjectives and adverbs are better than adjectives alone. 1st International Conference on Weblogs and Social Media (ICWSM), pp 203–206
Daud A, Khan W, Che D (2017) Urdu language processing: a survey. Artif Intell Rev 3:279–311
Feldman R (2013) Techniques and applications for sentiment analysis. Commun ACM 56(4):82–89
Furnkranz J (1999) Separate-and-conquer rule learning. Artif Intell Rev 13:3–54
Ghulam H, Zeng F, Li W, Xiao Y (2019) Deep learning-based sentiment analysis for Roman Urdu text. Procedia Comput Sci 147:131–135
Glasmachers T, Igel C (2006) Maximum-gain working set selection for SVMs. J Mach Learn Res 7:1437–1466
Han J, Micheline K, Jian P (2012) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann, Cambridge, Massachusetts
Hashim F, Khan MA (2016) Sentence level sentiment analysis using Urdu nouns. 6th International Conference on Language and Technology, pp 101–108
Ibrahim HS, Abdou SM, Gheith M (2015) Idioms-proverbs lexicon for modern standard arabic and colloquial sentiment analysis. Int J Comput Appl 118(11):26–31
Jawaid B, Kamran A, Bojar O (2014) A Tagged corpus and a tagger for Urdu. 9th International Conference on Language Resources and Evaluation (LREC), European Language Resources Association (ELRA), pp 2938–2943
Khan L, Amjad A, Ashraf N, Chang H-T, Gelbukh A (2021) Urdu sentiment analysis with deep learning methods. IEEE Access, 9:97803-97812
Khan M, Malik K (2018) Sentiment classification of customer’s reviews about automobiles in Roman Urdu. Future of Information and Communication Conference (FICC), Springer, pp 630–640
Kaur G, Chhabra A (2014) Improved J48 classification algorithm for the prediction of diabetes. Int J Comput Appl 98(22):13–17
Khan L, Amjad A, Ashraf N, Chang H-T (2022) Multi-class sentiment analysis of urdu text using multilingual BERT. Sci Rep 12:1–17
Kohavi R (1995) The power of decision tables. 8th European Conference on Machine Learning, Springer, pp 174–189
Kolkur S, Dantal G, Mahe R (2015) Study of different levels for Sentiment Analysis. Int J Curr Eng Technol 5(2):768–770
Mahmood Z, Safder I, Nawab R, Bukhari F, Nawaz R, Alfakeeh A, Aljohani N, Hassan S-U (2020) Deep sentiments in Roman Urdu text using recurrent convolutional neural network model. Inf Process Manag 57(4):102233
Manuel F-D, Eva C, B. Sen´en and A. Dinani (2014) Do we Need Hundreds of Classifiers to Solve Real World. J Mach Learn Res 15:3133–3181
Masood M, Azam F, Anwar MW, Rahman JU (2022) "Deep-learning based framework for sentiment analysis in Urdu language. 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), IEEE, pp 1–7
McHugh ML (2012) Interrater Reliability: The Kappa Statistic. Biochem Med 22(3):276–282
Mehmood K, Essam D, Shafi K (2019) Sentiment analysis system for Roman Urdu intelligent computing. SAI 2018. Adv Intell Syst Comput 858:29–42
Mehmood K, Essam D, Shafi K, Malik M (2020) An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis. Inf Process Manag 57(6):102368
Mehmood K, Essam D, Shafi K, Malik M (2019) Discriminative feature spamming technique for Roman Urdu sentiment analysis. IEEE Access 7:47991–48002
Mehmood K, Essam D, Shafi K, Malik MK (2019) Sentiment analysis for a resource poor language—Roman Urdu. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 19:1–15
Mehmood F, Ghani MU, Ibrahim MA, Shahzadi R, Mahmood W, Asim MN (2020) A precisely xtreme-multi channel hybrid approach for Roman Urdu sentiment analysis. IEEE Access 8:192740–192759
Mukhtar N, Khan MA (2018) Urdu sentiment analysis using supervised machine learning approach. Int J Pattern Recognit Artif Intell 32(2):1851001
Mukhtar N, Khan MA (2020) Effective lexicon-based approach for Urdu sentiment analysis. Artif Intell Rev 53:2521–2548
Mukhtar N, Khan MA, Chiragh N, Nazir S (2018) Identification and Handling of intensifiers for enhancing accuracy of Urdu sentiment analysis. Expert Systems 35(6):e12317
Mukhtar N, Khan MA, Chiragh N (2017) Effective use of evaluation measures for the validation of best classifier in Urdu sentiment analysis. Cogn Comput 9(4):446–456
Mukhtar N, Khan MA, Chiragh N (2018) Lexicon-based approach outperforms supervised machine learning approach for Urdu Sentiment analysis in multiple domains. Telematics Inform 35(8):2173–2183
Mukhtar N, Khan M, Chiragh N, Jan AU, Nazir S (2020) Recognition and effective handling of negations in enhancing the accuracy of Urdu sentiment analyzer. Mehran Univ Res J Eng Technol 39(4):759–771
Mukhtar N, Khan MA, Chiragh N (2022) An intelligent unsupervised approach for handling context-dependent words in Urdu sentiment analysis. Trans Asian Low-Resource Lang Inf Process 21:1–15
Rehman ZU, Bajwa IS (2016) Lexicon-based sentiment analysis for Urdu language. 6th International Conference on Innovative Computing Technology (INTECH), IEEE, pp 497–501
Rehman Z, Anwar W, Bajwa UI (2011) Challenges in Urdu text tokenization and sentence boundary. 2nd Workshop on South Southeast Asian Natural Language Processing, Association for Computational Linguistics, pp. 40–45
Riloff E, Wiebe J (2003) Learning extraction patterns for subjective expressions. Conference on Empirical Methods in Natural Language Processing, ACM, pp 105–112
Safder I, Mahmood Z, Sarwar R, Hassan S-U, Zaman F, Nawab RMA, Bukhari F, Abbasi RA, Alelyani S, Aljohani NR (2021) Sentiment analysis for Urdu online reviews using deep learning models. Expert Systems 38(8):e12751
Singh VK, Piryani R, Uddin A, Waila P (2013) Sentiment analysis of movie reviews: a new feature-based heuristic for aspect-level sentiment classification, in International Mutli-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s). IEEE, pp 712–717
Syed AZ, Aslam M, Martinez-Enriquez A (2011) Sentiment analysis of Urdu language: handling phrase-level negation. Advances in Artificial Intelligence. MICAI 2011. Lecture Notes Comput Sci, vol. 7094
Syed AZ, Aslam, M Martinez-Enriquez AM (2010) Lexicon based sentiment analysis of Urdu text using SentiUnits. Advances in Artificial Intelligence. MICAI 2010. Lect Notes Comput Sci, vol. 6437
Syed AZ, Aslam M, Martinez-Enriquez A (2011) Adjectival phrases as the sentiment carriers in the Urdu text. J Am Sci 7:644–652
Syed AZ, Aslam M, Martinez-Enriquez AM (2014) Associating targets with SentiUnits: a step forward in sentiment analysis of Urdu text. Artif Intell Rev 41:535–561
Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Linguist 37(2):267–307
Tan S, Zhang J (2008) An empirical study of sentiment analysis for chinese documents. Expert Syst Appl 34(4):2622–2629
Toutanova K, Manning CD (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Association for Computational Linguistics, pp 63–70
Zhang J, Jin R, Alexander YY, Hauptmann AG (2003) Modified logistic regression: an approximation to svm and its applications in large-scale text categorization. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington DC
Funding
No grants, funds or other support was received.
Author information
Authors and Affiliations
Contributions
• Amna Altaf: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Writing Original Draft, Investigation.
• Muhammad Waqas Anwar: Visualization, Supervision, Project Administration, Funding Acquisition, Writing and Review Editing, Investigation, Validation.
• Muhammad Hasan Jamal: Acquisition, Writing and Review Editing, Investigation, Validation.
• Usama Ijaz Bajwa: Acquisition, Writing and Review Editing, Investigation, Validation.
Corresponding author
Ethics declarations
Competing Interests
We have no financial and personal relationships with other people and organization.
Conflict of Interest
The authors declare no conflict of interest related to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Altaf, A., Anwar, M.W., Jamal, M.H. et al. Exploiting Linguistic Features for Effective Sentence-Level Sentiment Analysis in Urdu Language. Multimed Tools Appl 82, 41813–41839 (2023). https://doi.org/10.1007/s11042-023-15216-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15216-0