Abstract
We define a new, fully automated and domain-independent method for building feature vectors from Twitter text corpus for machine learning sentiment analysis based on a fuzzy thesaurus and sentiment replacement. The proposed method measures the semantic similarity of Tweets with features in the feature space instead of using terms’ presence or frequency feature vectors. Thus, we account for the sentiment of the context instead of just counting sentiment words. We use sentiment replacement to reduce the dimensionality of the feature space and a fuzzy thesaurus to incorporate semantics. Experimental results show that sentiment replacement yields up to 35% reduction in the dimensionality of the feature space. Moreover, feature vectors developed based on a fuzzy thesaurus show improvement of sentiment classification performance with multinomial naïve Bayes and support vector machine classifiers with accuracies of 83 and 85%, respectively, on the Stanford testing dataset. Incorporating the fuzzy thesaurus resulted in the best accuracy compared to the baselines with an increase greater than 3%. Comparable results were obtained with a larger dataset, the STS-Gold, indicating the robustness of the proposed method. Furthermore, comparison of results with previous work shows that the proposed method outperforms other methods reported in the literature using the same benchmark data.
Similar content being viewed by others
Notes
STS-Gold dataset can be requested from the authors at: http://kmi.open.ac.uk/people/member/hassan-saif
Stanford dataset official page: http://help.sentiment140.com/for-students
Stanford testing and training datasets can be downloaded from: https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit
References
Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: features selection for opinion classification in web forums. ACM Trans Inf Syst (TOIS) 26(3):1–34
Agarwal A, Xie B, Vovsha I, Rambow O (2011) Sentiment analysis of Twitter data. In: Proceedings of the workshop on languages in social media. Association for Computational Linguistics, pp 30–38
Barbosa L, Feng J (2010) Robust sentiment detection on Twitter from biased and noisy data. In: 23rd International conference on computational linguistics. Association for Computational Linguistics, pp 36–44
Batra S, Rao D (2010) Entity based sentiment analysis on Twitter. Science 9(4):1–12
Bhuta S, Doshi A, Doshi U, Narvekar M (2014) A review of techniques for sentiment analysis of Twitter data. In: International conference on issues and challenges in intelligent computing techniques (ICICT). IEEE, pp 583–591
Boulianne S (2015) Social media use and participation: a meta-analysis of current research. Inf Commun Soc 18(5):524–538
Cambria E, Schuller B, Xia Y, Havasi C (2013) New avenues in opinion mining and sentiment analysis. IEEE Intell Syst 28:15–21
Cambria E, Speer R, Havasi C, Hussain A (2010) SenticNet: a publicly available semantic resource for opinion mining. AAAI fall symposium: commonsense knowledge 10
Elfeky M, Elhawary M (2010) Mining Arabic business reviews. In: International conference in data mining. IEEE, Sydney. pp 1108–1113
Esuli A (2006) SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of the 5th conference on language resources and evaluation, pp 417–422 (2006)
Garcia I, Ng YK (2006) Eliminating redundant and less-informative RSS news articles based on word similarity and a fuzzy equivalence relation. In: Tools with artificial intelligence, ICTAI’06. IEEE, pp 465–473
Go A, Bhayani R, Huang L (2009). Twitter sentiment classification using distant supervision. Stanford digital library technologies projects
Hotho A, Nürnberger A, Paaß G (2005) A brief survey of text mining. Ldv Forum 20(1):19–62
Ismail HM (2014) Using concept maps and fuzzy set information retrieval model to dynamically personalize RSS feeds. Int J Comput Sci Netw Secur 14(2):10
Ismail HM, Harous S, Belkhouche B (2016) A comparative analysis of machine learning classifiers for Twitter sentiment analysis. Res Comput Sci 110:71–83
Ismail HM, Zaki N, Belkhouche B (2016) Using custom fuzzy thesaurus to incorporate semantics and reduce data sparsity for Twitter sentiment analysis. In: 3rd International conference on soft computing and machine intelligence (ISCMI). IEEE, pp 47–52
Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent Twitter sentiment classification. In: Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Portland, pp 151–160
Kao A, Poteet SR (eds) (2007) Natural language processing and text mining. Springer, Berlin
Kontopoulos E, Berberidis C, Dergiades T, Bassiliades N (2013) Ontology-based sentiment analysis of Twitter posts. Expert Syst Appl 40(10):4065–4074
Kraft DH, Bordogna G, Pasi G (1999) Fuzzy set techniques in information retrieval. Fuzzy Sets Approx Reason Inf Syst 5(6):469–510
Lee B, Pang L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Lima ACE, de Castro LN, Corchado JM (2015) A polarity analysis framework for Twitter messages. Applied Mathematics and Computation 270(1):756–767
Liu Y, Kliman-Silver C, Mislove A (2014) The Tweets They Are a-Changin: Evolution of Twitter Users and Behavior. ICWSM 30:5–314
LOL, OMG and ILY: 60 of The Dominating Abbreviations (2014) (Just English) Retrieved November 2015, from http://justenglish.me/2014/07/18/lol-omg-and-ily-60-of-the-dominating-abbreviations/
Manning CD, Raghavan P, Schütze H (2009) Text classification and naive bayes. In: Introduction to information retrieval. Cambridge University Press, pp 253–287
Ogawa Y, Morita T, Kobayashi K (1991) A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets Syst 39(2):163–179
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up: sentiment classification using machine learning techniques. Association for Computational Linguistics, Stroudsburg
Perez-Tellez F, Pinto D, Cardiff J, Rosso P (2010) On the difficulty of clustering company Tweets. In: 2nd International workshop on search and mining user-generated contents. ACM, New York, pp 95–102
Pew Research Center. (2014, November). Cell Phones, Social Media, and Campaign 2014. (Pew Research Center) Retrieved January 2016, from http://www.pewinternet.org/2014/11/03/cell-phones-social-media-and-campaign-2014
Porter MF (1980) An Algorithm for Suffix Stripping. Program 14(3):130–137
Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis a survey and a new dataset, the STS-gold. In: Interantional workshop on emotion and sentiment in social and expressive media: approaches and perspectives from AI (ESSEM 2013). Italy
Saif H, He Y, Alani H (2012) Alleviating data sparsity for twitter sentiment analysis. Making sense of microposts. CEUR-WS. org, Lyon, France
Saif H, He Y, Fernandez M, Alani H (2016) Contextual semantics for sentiment analysis of twitter. Inf Process Manag 52(1):5–19
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Conference on empirical methods in natural language processing. UK, pp 53–63
Strapparava C, Valitutti A (2004) WordNet affect: an affective extension of WordNet. LREC 4:1083–1086
Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Linguist 37:267–307
Turney PD, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation from association. ACM Trans Inf Syst 21(4):315–346
Vapnik VN, Vapnik V (1998) Statistical learning theory. Wiley, New York
Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis. In: International conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, Vancouver, pp 347–354
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Yerra R, Ng YK (2005) Detecting similar HTML documents using a fuzzy set information retrieval approach. In: Granular computing IEEE International Conference, IEEE. 2:693–699
Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353
Zaki N, Lazarova-Molnar S, El-Hajj W, Campbell P (2009) Protein-protein interaction based on pairwise similarity. BMC Bioinf 10(1):150
Zhou P, Chaovalit L (2005) Movie review mining: a comparison between supervised and unsupervised classification approaches. In: International conference on system sciences. IEEE, Hawaii, pp 112c–112c
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Research involving human participants and/or animals
Not applicable.
Informed consent
Not applicable.
Additional information
Communicated by S. Deb, T. Hanne, K.C. Wong.
Rights and permissions
About this article
Cite this article
Ismail, H.M., Belkhouche, B. & Zaki, N. Semantic Twitter sentiment analysis based on a fuzzy thesaurus. Soft Comput 22, 6011–6024 (2018). https://doi.org/10.1007/s00500-017-2994-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-017-2994-8