An image-text consistency driven multimodal sentiment analysis approach for social media

https://doi.org/10.1016/j.ipm.2019.102097Get rights and content

Highlights

  • We propose an image-text consistency measure for image-text posts.

  • We develop a multimodal sentiment analysis approach for image-text posts.

  • The proposed approach achieves superior performance in Flickr benchmark dataset.

Abstract

Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in the conventional social media. Consequently, the conventional text-based sentiment analysis has evolved into more complicated studies of multimodal sentiment analysis. To tackle the challenge of how to effectively exploit the information from both visual content and textual content from image-text posts, this paper proposes a new image-text consistency driven multimodal sentiment analysis approach. The proposed approach explores the correlation between the image and the text, followed by a multimodal adaptive sentiment analysis method. To be more specific, the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features, to develop a machine learning sentiment analysis approach. Extensive experiments are conducted to demonstrate the superior performance of the proposed approach.

Introduction

With the rapid growth of the social media, users tend to share their opinions in social media platforms such as Twitter, Facebook and Sina Weibo. These user-generated content is moving toward a diversification of content and formats, where people tend to post text embedded images, namely image-text posts (Soleymani, Garcia, Jou, Schuller, Chang, Pantic, 2017, Yu, Qiu, Wen, Lin, Liu, 2016). The posts are more informative since they contain visual content in addition to texts, unlike the conventional text-only posts. Sentiment analysis aims to automatically uncover the underlying attitude of the posts. Due to the rich sentiment cues that can be found in images, sentiment analysis of visual content can contribute more towards extracting user sentiments and understand user behavior, stock market forecasting and voting for politicians (Jiang, Yang, Lv, Tian, Meng, Yan, 2017, Nie, Peng, Wang, Zhao, Su, 2017, Peng, Shen, Fan, 2013). Taking the examples of some popular posters, as illustrated in Fig. 1, it can be seen that some posters record their time and express their expectations for the next period. Fig. 1(b) shows a dandelion with the words, ‘Goodbye November’, and show a beautiful tree with Chinese lanterns hanging from it. These kinds of posters can conjure up a positive sentence of confidence about the future. Fig. 1(c) - a scene from New York, posters can help them record valuable travelling experience in certain cities, like this photo in New York topic.

The major challenge of sentiment analysis for social media lies in effective feature extraction and representation for both text content and visual content. This challenge has drawn attention in the field of computer vision and especially retrieval and emotional semantic image retrieval, which applies computer vision technology to eliminate the affective gap between low-level features and the emotional content of an image (Machajdik & Hanbury, 2010). In these conventional approaches, low-level visual features, such as color histogram, are directly used into sentiment analysis with textual features. This has caused a great loss of emotional information from image, and consequently, there still exists a great semantic gap between low-level features and emotional content in the images. In view of this challenge, Borth, Ji, Chen, Breuel, and Chang (2013) proposed a more scientific SentiBank approach which models mid-level representations based on visual concepts, called Adjective Noun Pairs (ANPs), such as “cute cat” and “happy girl”, where both the sentimental strength of adjectives and detectability of nouns are considered. This approach has been proven to be useful in detecting emotions depicted in the images.

To tackle the challenge of analyzing both text content and visual content in image-text posts, a text-image consistency driven multimodal sentiment analysis approach is proposed in this paper. The proposed approach is motivated by these two observations: firstly, low-level visual features like color-based features have proved to be simple yet effective for image emotions (Chen, Eldeen, He, Kan, & Lu, 2015). Different colors have different sentiment effects; for example, colors like red, orange and yellow are warm color and gives positive energy and feelings. In view of this, these low-level visual features should be considered in multimodal sentiment analysis. Secondly, the relationship between images and text is very important for multimodal sentiment analysis. In free social media platforms, people can post image-text posts freely without the limitation of image and text consistency, so there exist fake posts which can mislead the sentiment analysis, as seen in Fig. 2(a). Also, to depict certain moods or ideas, people may use satiric expression for their strong sentiment. For instance in Fig. 2(b), the man fared poorly in his exam and he says, ‘what a nice day’ and wears an unhappy expression which indicates his depressed mood. In Fig. 2(c), the poster says “I am a big fan of red apple“, but in this context, the word ‘apple’ refers to a technology brand instead of a fruit. It is difficult to determine the true meaning just from this short context. In response to this problem, a new image-text correlation model is developed to examine the relationship between images and text. Furthermore, low-level visual features and different textual features are combined as enriched features to derive a multimodal sentiment analysis approach.

The contributions of this paper are two-fold.

  • First, to effectively exploit the information from both visual content and textual content from image-text posts, the proposed approach explores the correlation between the image and the text, where the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features.

  • Second, the proposed approach performs a multimodal adaptive sentiment analysis by incorporating the aforementioned image-text correlation model into the conventional SentiBank framework. First, for the related image-text data, four types of features (basic text feature, social feature, OCR feature from image and Adjective Noun Pairs (ANP) feature from image) are exploited, while for the unrelated image-text data, only the conventional ANPs features from image is used. By this way, the proposed approach is able to adaptively adjust features used for sentiment analysis.

The rest of this paper is organized as follows: First, a brief literature review is provided in Section 2. Then the proposed multimodal sentiment analysis approach is proposed in Section 3. This is further evaluated in extensive experimental results in Section 4. Finally, Section 5 concludes this paper.

Section snippets

Related work

Sentiment analysis, sometimes known as opinion mining, aims to judge emotional orientation (e.g., positive, negative or neural) based on user-generated content (Pang & Lee, 2008). Traditional sentiment analysis concentrates on textual sentiment analysis. However, research on visual sentiment analysis is relatively much less done. In recent years, much research has been done on visual sentiment analysis due to the exponential growth in Internet use. In this section, we will briefly discuss the

Proposed image-text consistency driven multimodal sentiment analysis approach

In this section, the proposed image-text consistency driven multimodal sentiment analysis approach is presented. The proposed approach, as illustrated in Fig. 3, consists of four critical components, which are briefly described as follows.

  • Preprocessing: In the preparation stage, some natural language processing methods e.g. stop-word removal, tokenization, stemming are used to process text data.

  • Feature extraction: Three main types of features, i.e. textual feature, visual feature, and

Dataset description

The dataset used in this is paper is the benchmark data of Visual Sentiment Ontology (Borth et al., 2013). It contains 603 images in total, covering a diverse set of over 21 topics, and there is a corresponding emotional value ground truth. For a further scoop of correlation relationship, it is necessary to make multiple labels for the datasets. Based on the datasets characteristics, six categories of labels are created, including (i) Noun means both text and image shares the same object like

Conclusions

An image-text consistency driven multimodal sentiment analysis approach has been proposed in this paper for social media. The proposed approach exploits a image-text consistency approach to decide whether the image content and the text content are consistent with each other, and then adaptively further merge the textual features and the visual features used in the conventional SentiBank to provide more accurate sentiment analysis for image-text posts.

References (48)

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. arXiv:...
  • C. Colombo et al.

    Semantics in visual information retrieval

    IEEE MultiMedia

    (1999)
  • G. Csurka et al.

    Visual categorization with bags of keypoints

    European conf. on computer vision, no. 1-22, Sep

    (2004)
  • R. Datta et al.

    Studying aesthetics in photographic images using a computational approach

    European conference on computer vision, may

    (2006)
  • K. Dave et al.

    Mining the peanut gallery: Opinion extraction and semantic classification of product reviews

    Int. conf. on world wide web, Mar

    (2003)
  • J. Deng et al.

    Imagenet: A large-scale hierarchical image database

    IEEE int. conf. on computer vision and pattern recognition, Jun

    (2009)
  • T. Hayashi et al.

    Image query by impression words-the IQI system

    IEEE Transactions on Consumer Electronics

    (1998)
  • K. He et al.

    Deep residual learning for image recognition

    IEEE int. conf. on computer vision and pattern recognition, Jun

    (2016)
  • B. Jou et al.

    Visual affect around the world: A large-scale multilingual visual sentiment ontology

    ACM int. conf. on multimedia, Oct

    (2015)
  • Y. Ke et al.

    The design of high-level features for photo quality assessment

    IEEE int. conf. on computer vision and pattern recognition, Jul

    (2006)
  • B. Li et al.

    Scaring or pleasing: Exploit emotional impact of an image

    ACM int. conf. on multimedia, Nov

    (2012)
  • J. Lilleberg et al.

    Support vector machines and Word2vec for text classification with semantic features

    IEEE int. conf. on cognitive informatics & cognitive computing, Jul

    (2015)
  • A.L. Maas et al.

    Learning word vectors for sentiment analysis

    Annual meeting of association for computational linguistics

    (2011)
  • J. Machajdik et al.

    Affective image classification using features inspired by psychology and art theory

    ACM int. conf. on multimedia, Oct

    (2010)
  • Cited by (108)

    View all citing articles on Scopus
    View full text