A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for twitter sentiment analysis
Graphical abstract
Introduction
The explosion of user-generated content (UGC) led to the opportunity to automatically discover associated sentiments. The term “sentiment” represents a positive/negative opinion, emotion, feeling, or thought expressed by a sentiment holder (user). Generally, sentiment analysis aims to automatically extract these sentiments from the text. Sentiment analysis aims to examine textual features to automatically seek a sentiment at the word, sentence, or document level. Sentiment analysis is popular nowadays in diverse fields including public-health monitoring [1], election trends [2], prediction of terrorism activities [3], and social network analysis [4].
Social networks provide online platforms to emulate social relationships between people. Twitter is one of the famous microblogging platforms that allows users to post real-time short messages (limited to 280 characters) called Tweets relevant to personal and social issues. On Twitter, more than 1 billion new Tweets have been posted every three days [5]. Twitter data has widely been explored by researchers to address diverse research issues e.g., sentiment analysis [4], [6], [7]. Sentiment analysis of Twitter data is a challenging problem in human computing. However, due to the restriction of 280 characters limit in a tweet, the informal language used by people poses a significant challenge to uncover the underlying sentiment of Tweets [6]. Therefore, it is crucial to use automatic intelligent techniques to perform Twitter sentiment analysis. Twitter sentiment analysis is important for many reasons such as identifying highly valued customers’ opinions for different products and services. Also, a broader range of diseases such as pandemics, election trends including potential candidates, and negative campaigning can be highlighted through Twitter sentiment analysis. Similarly, it can be useful to improve education policies by monitoring students’ performance.
Bag of words (BOWs) is a popular method in natural language processing for feature extraction in different domains, e.g. sentiment analysis [8], disease surveillance system [9], etc. However, the literature identified the limited capabilities of BOWs for extracting underlying semantics associated with text and dictates the use of Bag of Concepts (BOCs) [10]. The BOCs representation is a major drift from the BOWs approach. It intends to perform Concept-based sentiment analysis (CBSA) by utilizing semantic meanings of natural language opinions/text [10]. Concept-based sentiment analysis methods are unsupervised in the sense that pre-labeled data is not mandatory. SenticNet [11], [12] and Linguistic rules [13] are developed as a part of these methods. Relevant studies have revealed that these approaches cannot assign a sentiment polarity to all kinds of text due to the lack of richness of its knowledge base [10], [13]. Therefore, researchers ensembled other techniques along with CBSA methods. Among ML techniques, different classifiers have been integrated with unsupervised Concept-based sentiment methods for the sentiment prediction [10], [14].
The challenge faced in using classifiers is the requirement of pre-labeled data for the training process. It is a cumbersome task to label manually a large amount of unlabeled data. The labeling process may also be prolonged due to the time constraints of domain experts. Whereas, pre-labeled data is not a mandatory requirement for unsupervised (clustering) approaches. These methods accept unlabeled data and generate clusters of similar data instances.
In this paper, we have proposed a novel unsupervised ensemble framework based on Concept-based sentiment analysis methods and hierarchical clustering to perform Twitter sentiment analysis as shown in Fig. 1. In the proposed framework, both methods work in an unsupervised fashion for sentiment analysis. Hierarchical clustering has not been integrated earlier with concept-based methods. In this framework, initially, the concept-based analysis module, classifies Tweets using a) majority voting mechanism b) tie-breakers based on intensity ranking c) Linguistic Patterns. To the best of our knowledge, concept-based sentiment analysis has not been investigated earlier in this manner. Those Tweets, which are not classified by this module are then delegated to three popular agglomerative hierarchical clustering algorithms including single-linkage (SL), complete-linkage (CL), and average-linkage (AL). These methods have already been employed in some recent relevant research studies [15], [16], [17]. We have also performed a comparative analysis with earlier investigated classifiers i.e. Naive Bayes and Neural Network. An empirical study is performed on four English language-based Twitter datasets. Accuracy measure has been used to evaluate the performance of the proposed unsupervised framework in terms of polarity prediction of Tweets. Unigrams are considered for feature extraction and boolean and TF-IDF methods are used to represent features for delegated Tweets.
The main contributions of this research work are as follows:
- •
It proposes an unsupervised ensemble/cooperative framework built on concept-based and agglomerative hierarchical clustering for Twitter sentiment analysis.
- •
It presents a performance-based comparative analysis of clustering and classification when integrated with concept-based methods.
- •
It shows performance analysis of individual understudied techniques.
- •
It employs majority voting, tie-breakers criteria, and Linguistic rules in the concept-based sentiment analysis module.
- •
It also presents the performance of feature representation methods (Boolean and TF-IDF).
Section snippets
Related works
In this section, the literature relevant to Twitter sentiment analysis, clustering algorithms, concept-based sentiment analysis, and feature representation methods has been presented in detail.
In [18], Twitter sentiment analysis is performed using English language pandemic COVID-19 Tweets. A logistic regression algorithm is used for experimentation and better accuracy has been reported. In another study [19], Twitter sentiment analysis is performed on twenty-two datasets. Different features are
Proposed ensemble unsupervised framework encompassing concept-based and clustering approaches
The proposed framework is shown in Fig. 1. To address the research contributions, the Twitter datasets are given as input to the concept-level sentiment analysis module after necessary preprocessing. The module infers the sentiment label of Tweets and delegates those Tweets to understudied clustering approaches for which sentiment labels could not be discovered. The classified Tweets from the concept-based module and delegated Tweets from clustering algorithms are combined and evaluated. For
Results and discussion
In this section, the experimental results for each contribution are presented in detail. The accuracy (%) of each participating technique is shown. The performance of the proposed unsupervised ensemble based on the concept-based module and agglomerative hierarchical clustering, earlier investigated classifiers is shown in Figs. 2–3 (average results in Table 6). For this purpose, the classified Tweets from the concept-based modules and understudied algorithms are combined and correct predictions
Conclusion and future work
The ultimate goal of this research is to present an alternative unsupervised framework to avoid the tradeoffs of manual effort of labeling data for supervised techniques and modest accuracy of unsupervised techniques for analyzing Twitter sentiment data.
To address the first contribution, three agglomerative hierarchical clustering algorithms (SL, CL, AL) are ensembled with concept-based methods. Concept-based methods extract BOC and apply majority voting to discover sentiment labels. To meet
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (32)
- et al.
Sentic patterns: dependency-based rules for concept-level sentiment analysis
Knowl Based Syst
(2014) - et al.
Anaphora and coreference resolution: a review
Information Fusion
(2020) - et al.
Social media mining for public health monitoring and surveillance
(2016) Twitter use in election campaigns: a systematic literature review
Journal of Information Technology and Politics
(2016)- et al.
Information control and terrorism: tracking the mumbai terrorist attack through twitter
Information Systems Frontiers
(2011) - et al.
Sentence-level emotion detection framework using rule-based classification
Cognit Comput
(2017) - et al.
Twitter sentiment analysis: a bootstrap ensemble framework
(2013) - et al.
Twitter sentiment classification using distant supervision
CS224N Project Report Stanford
(2009) - et al.
Heuristic-assisted bert for twitter sentiment analysis
Int J Comput Intell Appl
(2021) - et al.
Baselines and bigrams: Simple, good sentiment and topic classification
(2012)
Multimodal bag-of-words for cross domains sentiment analysis
Sentic computing: a common-sense-based framework for concept-level sentiment analysis
Senticnet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings
Senticnet 6: Ensemble application of symbolic and subsymbolic ai for sentiment analysis
Intelligent asset allocation via market sentiment views
Computational Intellignce Magazine
Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering
Cited by (69)
Emotional analysis of joint sports quality expansion tasks based on multi-modal feature fusion
2024, Systems and Soft ComputingEmoComicNet: A multi-task model for comic emotion recognition
2024, Pattern RecognitionA multi-aspect framework for explainable sentiment analysis
2024, Pattern Recognition LettersMeta data analysis on building thermal management using phase change materials
2024, Journal of Energy StorageDetecting fake news by RNN-based gatekeeping behavior model on social networks
2023, Expert Systems with ApplicationsEarlGAN: An enhanced actor–critic reinforcement learning agent-driven GAN for de novo drug design
2023, Pattern Recognition Letters