An image-text consistency driven multimodal sentiment analysis approach for social media

doi:10.1016/j.ipm.2019.102097

Information Processing & Management

Volume 56, Issue 6, November 2019, 102097

https://doi.org/10.1016/j.ipm.2019.102097 Get rights and content

Highlights

•
We propose an image-text consistency measure for image-text posts.
•
We develop a multimodal sentiment analysis approach for image-text posts.
•
The proposed approach achieves superior performance in Flickr benchmark dataset.

Abstract

Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in the conventional social media. Consequently, the conventional text-based sentiment analysis has evolved into more complicated studies of multimodal sentiment analysis. To tackle the challenge of how to effectively exploit the information from both visual content and textual content from image-text posts, this paper proposes a new image-text consistency driven multimodal sentiment analysis approach. The proposed approach explores the correlation between the image and the text, followed by a multimodal adaptive sentiment analysis method. To be more specific, the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features, to develop a machine learning sentiment analysis approach. Extensive experiments are conducted to demonstrate the superior performance of the proposed approach.

Introduction

With the rapid growth of the social media, users tend to share their opinions in social media platforms such as Twitter, Facebook and Sina Weibo. These user-generated content is moving toward a diversification of content and formats, where people tend to post text embedded images, namely image-text posts (Soleymani, Garcia, Jou, Schuller, Chang, Pantic, 2017, Yu, Qiu, Wen, Lin, Liu, 2016). The posts are more informative since they contain visual content in addition to texts, unlike the conventional text-only posts. Sentiment analysis aims to automatically uncover the underlying attitude of the posts. Due to the rich sentiment cues that can be found in images, sentiment analysis of visual content can contribute more towards extracting user sentiments and understand user behavior, stock market forecasting and voting for politicians (Jiang, Yang, Lv, Tian, Meng, Yan, 2017, Nie, Peng, Wang, Zhao, Su, 2017, Peng, Shen, Fan, 2013). Taking the examples of some popular posters, as illustrated in Fig. 1, it can be seen that some posters record their time and express their expectations for the next period. Fig. 1(b) shows a dandelion with the words, ‘Goodbye November’, and show a beautiful tree with Chinese lanterns hanging from it. These kinds of posters can conjure up a positive sentence of confidence about the future. Fig. 1(c) - a scene from New York, posters can help them record valuable travelling experience in certain cities, like this photo in New York topic.

The major challenge of sentiment analysis for social media lies in effective feature extraction and representation for both text content and visual content. This challenge has drawn attention in the field of computer vision and especially retrieval and emotional semantic image retrieval, which applies computer vision technology to eliminate the affective gap between low-level features and the emotional content of an image (Machajdik & Hanbury, 2010). In these conventional approaches, low-level visual features, such as color histogram, are directly used into sentiment analysis with textual features. This has caused a great loss of emotional information from image, and consequently, there still exists a great semantic gap between low-level features and emotional content in the images. In view of this challenge, Borth, Ji, Chen, Breuel, and Chang (2013) proposed a more scientific SentiBank approach which models mid-level representations based on visual concepts, called Adjective Noun Pairs (ANPs), such as “cute cat” and “happy girl”, where both the sentimental strength of adjectives and detectability of nouns are considered. This approach has been proven to be useful in detecting emotions depicted in the images.

To tackle the challenge of analyzing both text content and visual content in image-text posts, a text-image consistency driven multimodal sentiment analysis approach is proposed in this paper. The proposed approach is motivated by these two observations: firstly, low-level visual features like color-based features have proved to be simple yet effective for image emotions (Chen, Eldeen, He, Kan, & Lu, 2015). Different colors have different sentiment effects; for example, colors like red, orange and yellow are warm color and gives positive energy and feelings. In view of this, these low-level visual features should be considered in multimodal sentiment analysis. Secondly, the relationship between images and text is very important for multimodal sentiment analysis. In free social media platforms, people can post image-text posts freely without the limitation of image and text consistency, so there exist fake posts which can mislead the sentiment analysis, as seen in Fig. 2(a). Also, to depict certain moods or ideas, people may use satiric expression for their strong sentiment. For instance in Fig. 2(b), the man fared poorly in his exam and he says, ‘what a nice day’ and wears an unhappy expression which indicates his depressed mood. In Fig. 2(c), the poster says “I am a big fan of red apple“, but in this context, the word ‘apple’ refers to a technology brand instead of a fruit. It is difficult to determine the true meaning just from this short context. In response to this problem, a new image-text correlation model is developed to examine the relationship between images and text. Furthermore, low-level visual features and different textual features are combined as enriched features to derive a multimodal sentiment analysis approach.

The contributions of this paper are two-fold.

•
First, to effectively exploit the information from both visual content and textual content from image-text posts, the proposed approach explores the correlation between the image and the text, where the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features.
•
Second, the proposed approach performs a multimodal adaptive sentiment analysis by incorporating the aforementioned image-text correlation model into the conventional SentiBank framework. First, for the related image-text data, four types of features (basic text feature, social feature, OCR feature from image and Adjective Noun Pairs (ANP) feature from image) are exploited, while for the unrelated image-text data, only the conventional ANPs features from image is used. By this way, the proposed approach is able to adaptively adjust features used for sentiment analysis.

The rest of this paper is organized as follows: First, a brief literature review is provided in Section 2. Then the proposed multimodal sentiment analysis approach is proposed in Section 3. This is further evaluated in extensive experimental results in Section 4. Finally, Section 5 concludes this paper.

Section snippets

Related work

Sentiment analysis, sometimes known as opinion mining, aims to judge emotional orientation (e.g., positive, negative or neural) based on user-generated content (Pang & Lee, 2008). Traditional sentiment analysis concentrates on textual sentiment analysis. However, research on visual sentiment analysis is relatively much less done. In recent years, much research has been done on visual sentiment analysis due to the exponential growth in Internet use. In this section, we will briefly discuss the

Proposed image-text consistency driven multimodal sentiment analysis approach

In this section, the proposed image-text consistency driven multimodal sentiment analysis approach is presented. The proposed approach, as illustrated in Fig. 3, consists of four critical components, which are briefly described as follows.

•
Preprocessing: In the preparation stage, some natural language processing methods e.g. stop-word removal, tokenization, stemming are used to process text data.
•
Feature extraction: Three main types of features, i.e. textual feature, visual feature, and

Dataset description

The dataset used in this is paper is the benchmark data of Visual Sentiment Ontology (Borth et al., 2013). It contains 603 images in total, covering a diverse set of over 21 topics, and there is a corresponding emotional value ground truth. For a further scoop of correlation relationship, it is necessary to make multiple labels for the datasets. Based on the datasets characteristics, six categories of labels are created, including (i) Noun means both text and image shares the same object like

Conclusions

An image-text consistency driven multimodal sentiment analysis approach has been proposed in this paper for social media. The proposed approach exploits a image-text consistency approach to decide whether the image content and the text content are consistent with each other, and then adaptively further merge the textual features and the visual features used in the conventional SentiBank to provide more accurate sentiment analysis for image-text posts.

References (48)

B. Jiang et al.
Internet cross-media retrieval based on deep learning
Journal of Visual Communication and Image Representation
(2017)
M. Liu et al.
Recognizing semantic correlation in image-text weibo via feature space mapping
Computer Vision and Image Understanding
(2017)
W.-Z. Nie et al.
Multimedia venue semantic modeling based on multimodal data
Journal of Visual Communication and Image Representation
(2017)
J. Peng et al.
Cross-modal social image clustering and tag cleansing
Journal of Visual Communication and Image Representation
(2013)
M. Soleymani et al.
A survey of multimodal sentiment analysis
Image and Vision Computing
(2017)
W. Wang et al.
A survey on emotional semantic image retrieval
IEEE Int. Conf. on Image Processing, Dec
(2008)
D. Borth et al.
Large-scale visual sentiment ontology and detectors using adjective noun pairs
ACM int. conf. on multimedia, Oct
(2013)
D. Cao et al.
Visual sentiment topic model based microblog image sentiment analysis
Multimedia Tools and Applications
(2016)
Chen, T., Borth, D., Darrell, T., & Chang, S. F. (2014). Deepsentibank: Visual sentiment concept classification with...
T. Chen et al.
Velda: Relating an image tweet’s text and images
(2015)

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. arXiv:...

C. Colombo et al.

Semantics in visual information retrieval

IEEE MultiMedia

(1999)

G. Csurka et al.

Visual categorization with bags of keypoints

European conf. on computer vision, no. 1-22, Sep

(2004)

R. Datta et al.

Studying aesthetics in photographic images using a computational approach

European conference on computer vision, may

(2006)

K. Dave et al.

Mining the peanut gallery: Opinion extraction and semantic classification of product reviews

Int. conf. on world wide web, Mar

(2003)

J. Deng et al.

Imagenet: A large-scale hierarchical image database

IEEE int. conf. on computer vision and pattern recognition, Jun

(2009)

T. Hayashi et al.

Image query by impression words-the IQI system

IEEE Transactions on Consumer Electronics

(1998)

K. He et al.

Deep residual learning for image recognition

IEEE int. conf. on computer vision and pattern recognition, Jun

(2016)

B. Jou et al.

Visual affect around the world: A large-scale multilingual visual sentiment ontology

ACM int. conf. on multimedia, Oct

(2015)

Y. Ke et al.

The design of high-level features for photo quality assessment

IEEE int. conf. on computer vision and pattern recognition, Jul

(2006)

B. Li et al.

Scaring or pleasing: Exploit emotional impact of an image

ACM int. conf. on multimedia, Nov

(2012)

J. Lilleberg et al.

Support vector machines and Word2vec for text classification with semantic features

IEEE int. conf. on cognitive informatics & cognitive computing, Jul

(2015)

A.L. Maas et al.

Learning word vectors for sentiment analysis

Annual meeting of association for computational linguistics

(2011)

J. Machajdik et al.

Affective image classification using features inspired by psychology and art theory

ACM int. conf. on multimedia, Oct

(2010)

Cited by (108)

A multimodal fusion network with attention mechanisms for visual–textual sentiment analysis
2024, Expert Systems with Applications
Existing visual–textual sentiment analysis methods usually get poor performance due to limited utilization of the correlation between different modalities, i.e., they neglect the heterogeneity and homogeneity of different modalities. To overcome these limitations, we propose a Multimodal Fusion Network (called MFN) with a multi-head self-attention mechanism. MFN can minimize noise interference between different modalities through neural networks and attention mechanisms to obtain independent visual and textual features. Furthermore, it can exploit correlations between fine-grained local region feature representations from multimodal with different numbers of hidden neurons to leverage complementary information from heterogeneous visual and textual data. Extensive experiments show MFN outperforms the 11 state-of-the-art methods by at least 0.11%, 0.13%, and 0.38% on Twitter, Flickr, and Getty image datasets, respectively.
Deep CNN with late fusion for real time multimodal emotion recognition
2024, Expert Systems with Applications
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.
A knowledge-augmented heterogeneous graph convolutional network for aspect-level multimodal sentiment analysis
2024, Computer Speech and Language
Aspect-level multimodal sentiment analysis has also become a new challenge in the field of sentiment analysis. Although there has been significant progress in the task based on image–text data, existing works do not fully deal with the implicit sentiment expression in data. In addition, they do not fully exploit the important information from external knowledge and image tags. To address these problems, we propose a knowledge-augmented heterogeneous graph convolutional network (KAHGCN). First, we propose a dynamic knowledge selection algorithm to select the most relevant external knowledge, thereby enhancing KAHGCN’s ability of understanding the implicit sentiment expression in review texts. Second, we propose a graph construction strategy to construct a heterogeneous graph that aggregates review text, image tags and external knowledge. Third, we propose a multilayer heterogeneous graph convolutional network to strengthen the interaction between information from external knowledge, review texts and image tags. Experimental results on two datasets demonstrate the effectiveness of the KAHGCN.
Capsule network-based deep ensemble transfer learning for multimodal sentiment analysis
2024, Expert Systems with Applications
Understanding the attitudes of users about different topics through their huge amount of comments and opinions on social networks is an emerging hot issue. Analyzing the sentiments of these posts and comments provides useful information utilized for many applications. Multimodal sentiments often contain more information than single-modal sentiments, including text, image, audio and video, which leads to better performance of multimodal sentiment analysis (MSA) compared to single-modal sentiment analysis (SSA). In this paper, a capsule network-based deep ensemble transfer learning approach, called DSY-ETL-MSA, is proposed for MSA on images and texts, the results of which are fused using Yager theory. Capsule networks and ensemble learning methods improve classification performance, and the deep transfer learning approach reduces the training time. A hybrid deep architecture is used for automatic feature extraction in the proposed method. For analyzing the sentiment of image modality, a pre-trained VGG16 model, fine-tuned on the datasets, is used to extract high-level features for image classification. A capsule convolutional neural network (CNN) is also separately used for extracting image features and classifying them. In the text modality, the pre-trained GloVe model is exploited to embed words and 2 separate classifiers are employed for text classifications. The Yager fusion rules are finally used for early and late fusions of the results of the classifiers. The results of the text and image classifiers are combined separately as the early fusion and the final results of them are fused at the decision level as the late fusion. The proposed model is empirically evaluated on the MVSA and T4SA datasets. The significant performance improvement compared to a variety of former methods is shown in the experimental results. The final accuracies obtained by the proposed method on the MVSA and T4SA datasets are 0.9866 and 0.9996, respectively.
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
2024, Neurocomputing
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific tasks or applications, not all modalities hold equal importance. Previous research, however, has either disregarded the importance of modalities altogether or solely focused on the importance of linguistic and non-linguistic modalities while neglecting the importance between non-linguistic modalities. To facilitate effective multimodal information fusion based on the relative importance of modalities, a novel multimodal fusion mode named Multimodal Transformer with Adaptive Modality Weighting (MTAMW) is proposed in this paper. Specifically, we introduce a multimodal adaptive weight matrix that allocates appropriate weights to each modality based on its contribution to sentiment analysis. Furthermore, a multimodal attention mechanism is introduced, utilizing multiple Softmax functions to compute attention weights, thereby efficiently fusion multimodal information via a single-stream Transformer. By meticulously considering the relative importance of each modality during the fusion process, more effective multimodal information fusion is achievable. Extensive experiments on benchmark datasets show that it is superior to or comparable to state-of-the-art methods on MSA tasks. The codes for our experiments are available at https://github.com/Vamos66/MTAMW.
A feature-based restoration dynamic interaction network for multimodal sentiment analysis
2024, Engineering Applications of Artificial Intelligence
Multimodal sentiment analysis aims to infer the sentiment of video bloggers from the features of multiple input modalities. However, there are problems such as signal noise and signal loss in the input phase and inefficient utilization of features in the modality fusion phase. To address these issues, this study proposes a feature-based restoration dynamic interaction network for multimodal sentiment analysis. Firstly, the idea of resampler and integration is employed to enhance visual and textual features during the input phase. Secondly, in the modal interaction phase, a dynamic routing network is employed. The network is centered on text modality and dynamically fuses visual and audio features. Finally, in the classification phase, multimodal representations are united to provide guidance for multimodal sentiment analysis. This study conducted experiments on the datasets MOSI, MOSEI and UR-FUNNY, which have 2199, 22856 and 16514 video segments respectively. The results show that the proposed method achieves an average improvement of about 1 point for three metrics on MOSI and 0.5 points for individual metrics on MOSEI compared to the state-of-the-art methods. Compared to other methods, the proposed approach achieve about 1 point improvement for individual metrics on UR-FUNNY dataset.

View all citing articles on Scopus

View full text

An image-text consistency driven multimodal sentiment analysis approach for social media

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed image-text consistency driven multimodal sentiment analysis approach

Dataset description

Conclusions

Journal of Visual Communication and Image Representation

Computer Vision and Image Understanding

Journal of Visual Communication and Image Representation

Journal of Visual Communication and Image Representation

Image and Vision Computing

Large-scale visual sentiment ontology and detectors using adjective noun pairs

ACM int. conf. on multimedia, Oct

Visual sentiment topic model based microblog image sentiment analysis

Multimedia Tools and Applications

Velda: Relating an image tweet’s text and images

Semantics in visual information retrieval

IEEE MultiMedia

Visual categorization with bags of keypoints

European conf. on computer vision, no. 1-22, Sep

Studying aesthetics in photographic images using a computational approach

European conference on computer vision, may

Mining the peanut gallery: Opinion extraction and semantic classification of product reviews

Int. conf. on world wide web, Mar

Imagenet: A large-scale hierarchical image database

IEEE int. conf. on computer vision and pattern recognition, Jun

Image query by impression words-the IQI system

IEEE Transactions on Consumer Electronics

Deep residual learning for image recognition

IEEE int. conf. on computer vision and pattern recognition, Jun

Visual affect around the world: A large-scale multilingual visual sentiment ontology

ACM int. conf. on multimedia, Oct

The design of high-level features for photo quality assessment

IEEE int. conf. on computer vision and pattern recognition, Jul

Scaring or pleasing: Exploit emotional impact of an image

ACM int. conf. on multimedia, Nov

Support vector machines and Word2vec for text classification with semantic features

IEEE int. conf. on cognitive informatics & cognitive computing, Jul

Learning word vectors for sentiment analysis

Annual meeting of association for computational linguistics

Affective image classification using features inspired by psychology and art theory

ACM int. conf. on multimedia, Oct