A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for twitter sentiment analysis

doi:10.1016/j.patrec.2022.04.004

Pattern Recognition Letters

Volume 158, June 2022, Pages 80-86

https://doi.org/10.1016/j.patrec.2022.04.004 Get rights and content

Highlights

•
A Novel ensemble/cooperative framework based on concept-based and clustering is proposed to perform Twitter sentiment Analysis.
•
It employs majority voting, tie breaker criteria, and linguistic rules in concept-based module.
•
Comparative analysis between clustering and classification is presented when integrated with concept based methods.
•
It presents the performance of feature representation methods (Boolean and TF-IDF).
•
Experimental results on Twitter Datasets revealed better performance of proposed framework.

Abstract

Concept-based sentiment analysis (CBSA) methods have gained prominence in natural language processing in recent years. These methods consider the underlying semantic meanings of text to perform different tasks such as Twitter sentiment analysis (assigning positive, negative, or neutral sentiment to Tweets). CBSA is superior to traditional statistical methods for accurately discovering sentiment labels. Due to a limited knowledge base, these methods are unable to identify the sentiment polarity of all kinds of text. Therefore, supervised learning techniques are mostly ensembled with CBSA methods to classify the whole text. These techniques require labeled data. It is a tedious and time-consuming task due to the manually labeling of large datasets (Such as Twitter datasets). Therefore, an unsupervised learning mechanism can be a better alternative to solve this problem. In this paper, a novel unsupervised learning framework based on Concept-based and hierarchical clustering is proposed for Twitter sentiment analysis. Popular hierarchical clustering methods including single linkage, complete linkage, and average linkage algorithms are ensembled serially. Two different feature representation methods including Boolean and Term frequency-inverse document frequency (TF-IDF) are investigated. We have also experimented with Well-known classifiers (Naive Bayes, Neural Network) for a fair comparison. Accuracy measure (proportion of correct predictions) is used to evaluate the performance of understudied techniques. It is empirically shown that the performance of unsupervised learning techniques is comparable with supervised learning techniques.

Graphical abstract

Introduction

The explosion of user-generated content (UGC) led to the opportunity to automatically discover associated sentiments. The term “sentiment” represents a positive/negative opinion, emotion, feeling, or thought expressed by a sentiment holder (user). Generally, sentiment analysis aims to automatically extract these sentiments from the text. Sentiment analysis aims to examine textual features to automatically seek a sentiment at the word, sentence, or document level. Sentiment analysis is popular nowadays in diverse fields including public-health monitoring [1], election trends [2], prediction of terrorism activities [3], and social network analysis [4].

Social networks provide online platforms to emulate social relationships between people. Twitter is one of the famous microblogging platforms that allows users to post real-time short messages (limited to 280 characters) called Tweets relevant to personal and social issues. On Twitter, more than 1 billion new Tweets have been posted every three days [5]. Twitter data has widely been explored by researchers to address diverse research issues e.g., sentiment analysis [4], [6], [7]. Sentiment analysis of Twitter data is a challenging problem in human computing. However, due to the restriction of 280 characters limit in a tweet, the informal language used by people poses a significant challenge to uncover the underlying sentiment of Tweets [6]. Therefore, it is crucial to use automatic intelligent techniques to perform Twitter sentiment analysis. Twitter sentiment analysis is important for many reasons such as identifying highly valued customers’ opinions for different products and services. Also, a broader range of diseases such as pandemics, election trends including potential candidates, and negative campaigning can be highlighted through Twitter sentiment analysis. Similarly, it can be useful to improve education policies by monitoring students’ performance.

Bag of words (BOWs) is a popular method in natural language processing for feature extraction in different domains, e.g. sentiment analysis [8], disease surveillance system [9], etc. However, the literature identified the limited capabilities of BOWs for extracting underlying semantics associated with text and dictates the use of Bag of Concepts (BOCs) [10]. The BOCs representation is a major drift from the BOWs approach. It intends to perform Concept-based sentiment analysis (CBSA) by utilizing semantic meanings of natural language opinions/text [10]. Concept-based sentiment analysis methods are unsupervised in the sense that pre-labeled data is not mandatory. SenticNet [11], [12] and Linguistic rules [13] are developed as a part of these methods. Relevant studies have revealed that these approaches cannot assign a sentiment polarity to all kinds of text due to the lack of richness of its knowledge base [10], [13]. Therefore, researchers ensembled other techniques along with CBSA methods. Among ML techniques, different classifiers have been integrated with unsupervised Concept-based sentiment methods for the sentiment prediction [10], [14].

The challenge faced in using classifiers is the requirement of pre-labeled data for the training process. It is a cumbersome task to label manually a large amount of unlabeled data. The labeling process may also be prolonged due to the time constraints of domain experts. Whereas, pre-labeled data is not a mandatory requirement for unsupervised (clustering) approaches. These methods accept unlabeled data and generate clusters of similar data instances.

In this paper, we have proposed a novel unsupervised ensemble framework based on Concept-based sentiment analysis methods and hierarchical clustering to perform Twitter sentiment analysis as shown in Fig. 1. In the proposed framework, both methods work in an unsupervised fashion for sentiment analysis. Hierarchical clustering has not been integrated earlier with concept-based methods. In this framework, initially, the concept-based analysis module, classifies Tweets using a) majority voting mechanism b) tie-breakers based on intensity ranking c) Linguistic Patterns. To the best of our knowledge, concept-based sentiment analysis has not been investigated earlier in this manner. Those Tweets, which are not classified by this module are then delegated to three popular agglomerative hierarchical clustering algorithms including single-linkage (SL), complete-linkage (CL), and average-linkage (AL). These methods have already been employed in some recent relevant research studies [15], [16], [17]. We have also performed a comparative analysis with earlier investigated classifiers i.e. Naive Bayes and Neural Network. An empirical study is performed on four English language-based Twitter datasets. Accuracy measure has been used to evaluate the performance of the proposed unsupervised framework in terms of polarity prediction of Tweets. Unigrams are considered for feature extraction and boolean and TF-IDF methods are used to represent features for delegated Tweets.

The main contributions of this research work are as follows:

•
It proposes an unsupervised ensemble/cooperative framework built on concept-based and agglomerative hierarchical clustering for Twitter sentiment analysis.
•
It presents a performance-based comparative analysis of clustering and classification when integrated with concept-based methods.
•
It shows performance analysis of individual understudied techniques.
•
It employs majority voting, tie-breakers criteria, and Linguistic rules in the concept-based sentiment analysis module.
•
It also presents the performance of feature representation methods (Boolean and TF-IDF).

Section snippets

Related works

In this section, the literature relevant to Twitter sentiment analysis, clustering algorithms, concept-based sentiment analysis, and feature representation methods has been presented in detail.

In [18], Twitter sentiment analysis is performed using English language pandemic COVID-19 Tweets. A logistic regression algorithm is used for experimentation and better accuracy has been reported. In another study [19], Twitter sentiment analysis is performed on twenty-two datasets. Different features are

Proposed ensemble unsupervised framework encompassing concept-based and clustering approaches

The proposed framework is shown in Fig. 1. To address the research contributions, the Twitter datasets are given as input to the concept-level sentiment analysis module after necessary preprocessing. The module infers the sentiment label of Tweets and delegates those Tweets to understudied clustering approaches for which sentiment labels could not be discovered. The classified Tweets from the concept-based module and delegated Tweets from clustering algorithms are combined and evaluated. For

Results and discussion

In this section, the experimental results for each contribution are presented in detail. The accuracy (%) of each participating technique is shown. The performance of the proposed unsupervised ensemble based on the concept-based module and agglomerative hierarchical clustering, earlier investigated classifiers is shown in Figs. 2–3 (average results in Table 6). For this purpose, the classified Tweets from the concept-based modules and understudied algorithms are combined and correct predictions

Conclusion and future work

The ultimate goal of this research is to present an alternative unsupervised framework to avoid the tradeoffs of manual effort of labeling data for supervised techniques and modest accuracy of unsupervised techniques for analyzing Twitter sentiment data.

To address the first contribution, three agglomerative hierarchical clustering algorithms (SL, CL, AL) are ensembled with concept-based methods. Concept-based methods extract BOC and apply majority voting to discover sentiment labels. To meet

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (32)

S. Poria et al.
Sentic patterns: dependency-based rules for concept-level sentiment analysis
Knowl Based Syst
(2014)
R. Sukthanker et al.
Anaphora and coreference resolution: a review
Information Fusion
(2020)
M.J. Paul et al.
Social media mining for public health monitoring and surveillance
(2016)
A. Jungherr
Twitter use in election campaigns: a systematic literature review
Journal of Information Technology and Politics
(2016)
O. Oh et al.
Information control and terrorism: tracking the mumbai terrorist attack through twitter
Information Systems Frontiers
(2011)
M.Z. Asghar et al.
Sentence-level emotion detection framework using rule-based classification
Cognit Comput
(2017)
A. Hassan et al.
Twitter sentiment analysis: a bootstrap ensemble framework
(2013)
A. Go et al.
Twitter sentiment classification using distant supervision
CS224N Project Report Stanford
(2009)
G. Yenduri et al.
Heuristic-assisted bert for twitter sentiment analysis
Int J Comput Intell Appl
(2021)
S. Wang et al.
Baselines and bigrams: Simple, good sentiment and topic classification
(2012)

N. Cummins et al.

Multimodal bag-of-words for cross domains sentiment analysis

(2018)

E. Cambria et al.

Sentic computing: a common-sense-based framework for concept-level sentiment analysis

(2015)

E. Cambria et al.

Senticnet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings

(2018)

E. Cambria et al.

Senticnet 6: Ensemble application of symbolic and subsymbolic ai for sentiment analysis

(2020)

F.Z. Xing et al.

Intelligent asset allocation via market sentiment views

Computational Intellignce Magazine

(2018)

S. Sharma et al.

Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering

(2019)

Cited by (69)

Emotional analysis of joint sports quality expansion tasks based on multi-modal feature fusion
2024, Systems and Soft Computing
A multi-modal feature based motion emotion analysis model based on a fusion deep learning model is proposed for the problem of analyzing the motion emotions of participants in the joint exercise quality expansion task. This model involves three major modalities: EEG signals, peripheral physiological signals, and facial expression signals, and processes and fuses the information of these three main modalities to achieve the effect of processing multi-dimensional motor emotional information. At the same time, this study introduces the design concept of residual networks, using self attention modules and multi head mutual attention modules to process different modal features. The results showed that the combination of peripheral physiological modality and facial expression modality had the highest accuracy among the three modality combinations, with an accuracy rate of 88.8 %. The feature fusion method based on the cascaded residual attention mechanism module has better accuracy and F1 Score performance than other methods. In addition, different emotional states can be effectively identified and distinguished in these three modalities, indicating that the model has a wide range of possibilities in practical applications.
EmoComicNet: A multi-task model for comic emotion recognition
2024, Pattern Recognition
The emotion and sentiment associated with comic scenes can provide potential information for inferring the context of comic stories, which is an essential pre-requisite for developing comics’ automatic content understanding tools. Here, we address this open area of comic research by exploiting the multi-modal nature of comics. The general assumptions for multi-modal sentiment analysis methods are that both image and text modalities are always present at the test phase. However, this assumption is not always satisfied for comics since comic characters’ facial expressions, gestures, etc., are not always clearly visible. Also, the dialogues between comic characters are often challenging to comprehend the underlying context. To deal with these constraints of comic emotion analysis, we propose a multi-task-based framework, namely EmoComicNet, to fuse multi-modal information (i.e., both image and text) if it is available. However, the proposed EmoComicNet is designed to perform even when any modality is weak or completely missing. The proposed method potentially improves the overall performance. Besides, EmoComicNet can also deal with the problem of weak or absent modality during the training phase.
A multi-aspect framework for explainable sentiment analysis
2024, Pattern Recognition Letters
The demand for explainable sentiment analysis has intensified, emphasizing the need for models that are both accurate and interpretable. This research introduces the Multi-Aspect Framework for Explainable Sentiment Analysis (MAFESA), a groundbreaking model that seamlessly integrates aspect extraction, sentiment prediction, and explainability. By harnessing the power of Latent Dirichlet Allocation (LDA) for aspect extraction and leveraging hierarchical neural networks for sentiment prediction, MAFESA achieves remarkable performance metrics. Notably, our framework outperforms state-of-the-art baseline models across benchmark datasets such as IMDB Movie Reviews, Amazon Product Reviews, and Twitter Sentiment Analysis. The inclusion of an explainability module, built around techniques like LIME, offers unparalleled insights into the model’s decision-making, ensuring predictions are transparent and trustworthy. Our performance evaluations, underscored by a thorough ablation study, cross-validation, and rigorous statistical tests, attest to MAFESA’s robustness, generalizability, and superiority. A detailed qualitative analysis further showcases the model’s adeptness at discerning aspect-level nuances and delivering clear explanations for sentiment predictions. This research not only sets a new benchmark in explainable sentiment analysis but also provides a holistic framework that balances prediction precision with interpretability.
Meta data analysis on building thermal management using phase change materials
2024, Journal of Energy Storage
Despite the extensive research conducted on phase change materials (PCM) and their effect on thermal comfort in buildings around the globe, there is still a lack of clarity regarding the direction of development and performance in this field of study. A comprehensive analysis employing bibliometrics and text-mining techniques was conducted to provide a multi dimensional overview on the role of PCMs in building thermal management. The Web of Science database was mined to collect research publication patterns and information across three decades concerning PCM deployment in building thermal management. Vos Viewer, Biblioshiny, Microsoft Excel, and orange data mining were used to analyze the corpus. The publications tally in this domain was comparatively low till 2005 but saw a dramatic uptick from 2016 onwards. China (215), India (71), and Italy (47) have the highest publication count. Finland (234) and New Zealand (166.1) have the highest average article citations. Zhang Zhengguo has authored the most papers in this discipline (17 publications). Still, Farid, Mohammed M (1571) and Luisa F. Cabeza (1405) are highly cited authors. Phase change material is the most used keyword. Additional findings include prominent institutions, authorship networks/collaborations, and the journal with maximum citations. According to the sentiment analysis of abstracts, 97.79 % of researchers are optimistic about building thermal management using PCMs. The research outcomes of this study will deliver significant contributions to the field and serve as a reference point for scholars and decision-makers as they confront the challenge of rising energy consumption and thermal comfort in buildings.
Detecting fake news by RNN-based gatekeeping behavior model on social networks
2023, Expert Systems with Applications
Social network users are not only news disseminators and consumers, but also gatekeepers. News gatekeepers are regular users who actively participate in news propagation. This study introduces the concept of gatekeepers into social network fake news detection and then presents a recurrent neural network (RNN)-based gatekeeping behavior model (RGBM). Based on this, we propose a social network fake news detection method. The proposed method includes model training and fake-news detection. In the fake news detection phase, every observation sequence is updated in real time and the output of every observation sequence is calculated in real time. Therefore, the proposed method can detect social network fake news in real time. Several RNNs are compared on real datasets from Twitter and Weibo. The experimental results show that the gate recurrent unit (GRU) achieves the best comprehensive performance. On the Twitter and Weibo datasets, the proposed method had an overall accuracy of 0.985, recall of 0.978, F1 of 0.976 and loss of 0.058. In a comparison test, the proposed method outperformed several state-of-the-art approaches. The experimental results of the timeliness evaluation also demonstrated that the proposed method can effectively detect fake news in the early and middle stages of news propagation.
EarlGAN: An enhanced actor–critic reinforcement learning agent-driven GAN for de novo drug design
2023, Pattern Recognition Letters
Deep generative models, such as Generative Adversarial Networks (GANs), have attracted the attention of researchers in the $d e n o v o$ drug design field. However, traditional GANs are typically used for image processing. Therefore, they are unsuitable for Simplified Molecular-Input Line-Entry System (SMILES) strings due to their discrete nature. Previous studies addressed this problem by combining Reinforcement Learning (RL) approaches with Monte Carlo tree search. However, for large chemical datasets, the molecule generation process is time-consuming due to the lengthy atom-by-atom sampling process with cumulative reward, an essence of the Monte Carlo tree search-based RL approaches. To address this problem, we propose an enhanced actor–critic RL agent-driven GAN, called EarlGAN, for $d e n o v o$ drug design. Specifically, EarlGAN’s generator acts as an actor to generate SMILES strings, and the discriminator acts as a critic to perform discrimination. EarlGAN makes autoregressive predictions at the atomic level. While the generator is based on previously generated atoms, the discriminator discriminates using a bidirectional pass over the atoms, including the current atom that is being predicted. We integrate moment, global-level discrimination rewards, and information entropy maximization. The moment rewards reduce the computation time, and the global-level rewards ensure the consistency of the molecule, whereas the information entropy maximization leads to a more diverse sample generation. Experiments and ablation studies verify the effectiveness of EarlGAN for $d e n o v o$ drug design on the QM9 and ZINC datasets. In addition, the visualization analysis provides insight into EarlGAN and supports our conclusion.

View all citing articles on Scopus

View full text

A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for twitter sentiment analysis

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related works

Proposed ensemble unsupervised framework encompassing concept-based and clustering approaches

Results and discussion

Conclusion and future work

Declaration of Competing Interest

Knowl Based Syst

Information Fusion

Social media mining for public health monitoring and surveillance

Twitter use in election campaigns: a systematic literature review

Journal of Information Technology and Politics

Information control and terrorism: tracking the mumbai terrorist attack through twitter

Information Systems Frontiers

Sentence-level emotion detection framework using rule-based classification

Cognit Comput

Twitter sentiment analysis: a bootstrap ensemble framework

Twitter sentiment classification using distant supervision

CS224N Project Report Stanford

Heuristic-assisted bert for twitter sentiment analysis

Int J Comput Intell Appl

Baselines and bigrams: Simple, good sentiment and topic classification

Multimodal bag-of-words for cross domains sentiment analysis

Sentic computing: a common-sense-based framework for concept-level sentiment analysis

Senticnet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings

Senticnet 6: Ensemble application of symbolic and subsymbolic ai for sentiment analysis

Intelligent asset allocation via market sentiment views

Computational Intellignce Magazine

Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering